Elastic Observability Labs - Kubernetes

Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch

Mon, 19 Aug 2024 00:00:00 GMT

Introduction:

Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the Elastic Kubernetes Audit Log Integration.

In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:

AWS Custom Logs integration (which we will utilize in this blog)
AWS Firehose to send logs from Cloudwatch to Elastic
AWS General integration which supports many AWS sources

In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.

Kubernetes auditing documentation describes the need for auditing in order to get answers to the questions below:

What happened?
When did it happen?
Who initiated it?
What resource did it occur on?
Where was it observed?
From where was it initiated (Source IP)?
Where was it going (Destination IP)?

Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements.

We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.

Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the audit policy file. The audit policy file is submitted against the kube-apiserver. It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the kube-apiserver directly. For example, AWS EKS allows for this logging to be done only by the control plane.

In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.

A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS is logged on AWS CloudWatch in the following format:

Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour?

Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the Elastic Common Schema(ECS) to get the best search and analytics performance. This means that there needs to be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.

Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the ECS designed for parsing the Kubernetes audit logs already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.

What we’re going to do is:

Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and Elasticsearch AWS Custom Logs integration to read from logs from CloudWatch. Note: please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.
Create two simple ingest pipelines (we do this for best practices of isolation and composability)
The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline
The second custom pipeline will associate the JSON message field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then reroute the message to the correct data stream, kubernetes.audit_logs-default, which in turn applies all the proper mapping and ingest pipelines for the incoming message
The overall flow will be

1. Create an AWS CloudWatch integration:

a. Populate the AWS access key and secret pair values

b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page

2. Next, we will configure the custom ingest pipeline

We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline logs-aws_logs.generic@custom

From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration

PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    "processors": [
      {
        "pipeline": {
          "if": "ctx.message.contains('audit.k8s.io')",
          "name": "logs-aws-process-k8s-audit"
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field": "kubernetes.audit"
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "reroute": {
        "dataset": "kubernetes.audit_logs",
        "namespace": "default"
      }
    }
  ]
}

Let’s understand this further:

When we create a Kubernetes integration, we get a managed index template called logs-kubernetes.audit_logs that writes to the pipeline called logs-kubernetes.audit_logs-1.62.2 by default
If we look into the pipeline logs-kubernetes.audit_logs-1.62.2, we see that all the processor logic is working against the field kubernetes.audit. This is the reason why our json processor in the above code snippet is creating a field called kubernetes.audit before dropping the original message field and rerouting. Rerouting is directed to the kubernetes.audit_logs dataset that backs the logs-kubernetes.audit_logs-1.62.2 pipeline (dataset name is derived from the pipeline name convention that’s in the format logs--version)

3. Now let’s verify that the logs are actually flowing through and the audit message is being parsed

a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to deploy Elastic Agent and for this exercise we will deploy using docker which is quick and easy.

% docker run --env FLEET_ENROLL=1 --env FLEET_URL=<> --env FLEET_ENROLLMENT_TOKEN=<>  --rm docker.elastic.co/beats/elastic-agent:8.19.11

b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!

4. Let's do a quick recap of what we did

We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration.

In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.

Debugging Azure Networking for Elastic Cloud Serverless

Thu, 05 Jun 2025 00:00:00 GMT

Summary of Findings

Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.

Setting the Scene

Elastic Cloud Serverless is a fully managed solution that allows you to deploy and use Elastic for your use cases without managing the underlying infrastructure. Built on Kubernetes, it represents a shift in how you interact with Elasticsearch. Instead of managing clusters, nodes, data tiers, and scaling, you create serverless projects that are fully managed and automatically scaled by Elastic. This abstraction of infrastructure decisions allows you to focus solely on gaining value and insight from your data.

Elastic Cloud Serverless is generally available (GA) on AWS, GCP and currently in Technical Preview on Azure. As part of preparing Elastic Cloud Serverless GA on Azure, we have been conducting extensive performance and scalability tests to ensure that our users get a consistent and reliable user experience.

In this post, we’ll take you behind the scenes of a deep technical investigation into a surprising performance issue that affected Serverless Elasticsearch in our Azure Kubernetes clusters. At first, the network seemed like the least likely place to look, especially with a high-speed 100 Gb/s interface on the host backing it. But as we dug deeper, with help from the Microsoft Azure team, that’s exactly where the problem led us.

Unexpected Results!

While the high-level architectures and system design patterns of the major cloud provider’s systems are often similar, the implementations are different, and these differences can have dramatic impacts on a system’s performance characteristics.

One of the most significant differences between the different cloud providers is that the underlying hypervisor software and server hardware of the Virtual Machines can vary significantly, even between instance families of the same provider.

There is no way to fully abstract the hardware away from an application like Elasticsearch. Fundamentally, its performance is dictated by the CPU, memory, disks, and network interfaces on the physical server. In preparation for the Elastic Cloud Serverless GA on Azure, our Elasticsearch Performance team kicked off large-scale load testing against Serverless Elasticsearch projects running on Azure Kubernetes Service (AKS), using ARM-based VMs (we’re big fans!). Throughout this process, we relied heavily on Elastic tools to analyse system behaviour, identify bottlenecks, and validate performance under load.

To perform these scale and load tests, the Elasticsearch Performance team use Rally, an open-source benchmarking tool designed to measure the performance of Elasticsearch clusters. The workload (or in Rally nomenclature, ‘Track’) used for these tests was the GitHub Archive Track. Rally collects and sends test telemetry using the official Python client to a separate Elasticsearch cluster running Elastic Observability, which allows for monitoring and analysis during these scale and load tests in real time via Kibana.

When we looked at the results, we observed that the indexing rate (the number of docs/s) for the Serverless projects was not only much lower than we had expected for the given hardware, but the throughput was also quite unstable. There were peaks and valleys, interspersed with frequent errors, whereas we were instead expecting a stable indexing rate for the duration of the test.

These tests are designed to push the system to its limits, and in doing so, they surfaced unexpected behavior in the form of unstable indexing throughput and intermittent errors. This was precisely the kind of problem we'd hoped to uncover prior to going GA — giving us the opportunity to work closely with Azure.

![Indexing Rate with Packet Loss](/assets/images/debugging-aks-packet-loss/indexing-rate-before.png) _A Kibana visualisation of Rally telemetry, showing fluctuating Elasticsearch indexing rates alongside spikes in 5xx and 4xx HTTP error responses._

Debugging!

Debugging performance issues can feel a little bit like trying to find a ‘Butterfly in a Hurricane’, so it’s crucial that you take a methodological approach to analysing application and system performance.

Using methodologies helps you to be more consistent and thorough in your debugging, and avoids missing things. We started with the Utilisation Saturation and Errors (USE) Method, looking at both the client and server side to identify any obvious bottlenecks in the system.

Elastic's Site Reliability Engineers (SREs) maintain a suite of custom Elastic Observability dashboards designed to visualise data collected from various Elastic Integrations. These dashboards provide deep visibility into the health and performance of Elastic Cloud infrastructure and systems.

For this investigation, we leveraged a custom dashboard built using metrics and log data from the System and Linux Integrations:

![Node Overview Dashboard](/assets/images/debugging-aks-packet-loss/overview-dashboard.png) _One of many Elastic Observability dashboards built and maintained by the SRE team._

Following the USE Method, these dashboards highlight resource utilisation, saturation, and errors across our systems. With their help, we quickly identified that the AKS nodes hosting the Elasticsearch pods under test were dropping thousands of packets per second.

![Node Packet Loss Before Tuning](/assets/images/debugging-aks-packet-loss/packet-loss-before.png) _A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing the rate of packet drops per second for AKS nodes._

Dropping packets forces reliable protocols, such as TCP, to retransmit any missing packets. These retransmissions can introduce significant delays, which kills the throughput of any system where client requests are only triggered upon the previous request completion (known as a Closed System).

To investigate further, we jumped onto one of the AKS nodes exhibiting the packet loss to check the basics. First off, we wanted to identify what type of packet drops or errors we’re seeing; is it for specific pods, or the host as a whole?

root@aks-k8s-node-1:~# ip -s link show
2: eth0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    373507935420 134292481      0       0       0      15
    TX:    bytes   packets errors dropped carrier collsns
    644247778936 303191014      0       0       0       0
3: enP42266s1:  mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    386782548951 307000571      0       0 5321081       0
    TX:    bytes   packets errors dropped carrier collsns
    655758630548 477594747      0       0       0       0
    altname enP42266p0s2
15: lxc0ca0ec41ecd2@if14:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

In this output you can see the enP42266s1 interface is showing a significant number of packets in the missed column. That’s interesting, sure, but what does missed actually represent? And what is enP42266s1?

To understand, let’s look at roughly what happens when a packet arrives at the NIC:

A packet arrives at the NIC from the network.
The NIC uses DMA (Direct Memory Access) to place the packet into a receive ring buffer allocated in memory by the kernel, mapped for use by the NIC. Since our NICs supports multiple hardware queues, each queue has its own dedicated ring buffer, IRQ, and NAPI context.
The NIC raises a hardware interrupt (IRQ) to notify the CPU that a packet is ready.
The CPU runs the NIC driver’s IRQ handler. The driver schedules a NAPI (New API) poll to defer packet processing to a softirq context. A mechanism in the Linux kernel that defers work to be processed outside of the hard IRQ context, for better batching and CPU efficiency, enabling improved scalability.
The NAPI poll function is executed in a softirq context (NET_RX_SOFTIRQ) and retrieves packets from the ring buffer. This polling continues either until the driver’s packet budget is exhausted (net.core.netdev_budget) or the time limit is hit (net.core.netdev_budget_usecs).
Each packet is wrapped in an sk_buff (socket buffer) structure, which includes metadata such as protocol headers, timestamps, and interface identifiers.
If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via enqueue_to_backlog). The maximum size of this backlog is controlled by the net.core.netdev_max_backlog sysctl.
Packets are then handed off to the kernel’s networking stack for routing, filtering, and protocol-specific processing (e.g. TCP, UDP).
Finally, packets reach the appropriate socket receive buffer, where they are available for consumption by the user-space application.

Visualised, it looks something like this:

![Linux Packet Flow Diagram](/assets/images/debugging-aks-packet-loss/packet-flow.png) _Image © 2018 Leandro Moreira. Used under the [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause). Source: [GitHub repository](https://github.com/leandromoreira/linux-network-performance-parameters)._

The missed counter is incremented whenever the NIC tries to DMA a packet into a fully occupied ring buffer. The NIC essentially "misses" the chance to deliver the packet to the VM’s memory. However, what’s most interesting is that this counter seldom increments for VMs. This is because Virtual NICs are usually implemented as software via the hypervisor, which typically has much more flexible memory management compared to the physical NICs and can reduce the chance of ring buffer overflow.

We mentioned earlier that we’re building Azure Elasticsearch Serverless on top of Azure’s AKS service, which is important to note because all of our AKS nodes use an Azure feature called Accelerated Networking. In this setup, network traffic is delivered directly to the VM’s network interface, bypassing the hypervisor. This is enabled by single root I/O virtualization (SR-IOV), which offers much lower latency and higher throughput than traditional VM networking. Each node is physically connected to a 100 Gb/s network interface, although the SR-IOV Virtual Function (VF) exposed to the VM typically provides only a fraction of that total bandwidth.

Despite the VM only having a fraction of the 100 Gb/s bandwidth, microbursts are still very possible. These physical interfaces are so fast that they can transmit and receive multiple packets in just nanoseconds, far faster than most buffers or processing queues can absorb. At these timescales, even a short-lived burst of traffic can overwhelm the receiver, leading to dropped packets and unpredictable latency.

Direct access to the SR-IOV interface means that our VMs are responsible for handling the hardware interrupts triggered by the NIC in a timely manner, if there's any delay in handling the hardware interrupt (e.g. waiting to be scheduled onto CPU by the hypervisor) then network packets can be missed!

Firstly - NIC-level Tuning

Since we'd confirmed that our VMs were using SR-IOV, we established that the enP42266s1 and eth0 interfaces were a bonded pair and acted as a single interface. Knowing this, then we reasoned that we should be able to adjust the ring buffer values directly using ethtool.

root@aks-k8s-node-1:~# ethtool -g enP42266s1
Ring parameters for enP42266s1:
Pre-set maximums:
RX:		8192
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	n/a
RX Jumbo:	n/a
TX:		1024

In the output above, we were using only 1/8th of the available ring buffer descriptors. These values were set by the OS defaults, which generally aim to balance performance and resource usage. Set too low, they risk packet drops under load; set too high, they can lead to unnecessary memory consumption. We knew that the VMs were backed by a virtual function carved out of the directly attached 100 Gb/s network interface, which is fast enough to deliver microbursts that could easily overwhelm small buffers. To better absorb those short, high-intensity bursts of traffic, we increased the NIC’s RX ring buffer size from 1024 to 8192. Using a privileged DaemonSet, we rolled out the change across all of our AKS nodes by installing a udev rule to automatically increase the buffer size:

# Match Mellanox ConnectX network cards and run ethtool to update the ring buffer settings
ENV{INTERFACE}=="en*", ENV{ID_NET_DRIVER}=="mlx5_core", RUN+="/sbin/ethtool -G %k rx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE} tx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE}"

![AKS Node Packet Loss after RX ring buffer change](/assets/images/debugging-aks-packet-loss/packet-loss-after.png) _A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing packet loss reduced by ~99% after increasing the NIC's RX ring buffer values._

As soon as the change had been applied to all AKS nodes we stopped ‘missing’ RX packets! Fantastic! As a result of this simple change we observed a significant improvement in our indexing throughput and stability.

![Indexing rate after RX ring buffer change](/assets/images/debugging-aks-packet-loss/indexing-rate-after.png) _A Kibana visualisation of Rally telemetry, showing stable and improved Elasticsearch indexing rates after increasing the RX ring buffer size._

Job done, right? Not quite..

Further improvements - Kernel-level Tuning

Eagle eyed readers may have noticed two things:

In the previous screenshot, despite adjusting the physical RX ring buffer values, we still observed a small number of dropped packets on the TX side.
In the original ip link -s show output, one of the ‘logical’ interfaces used by the Elasticsearch pod was showing dropped packets on both the TX and RX sides.

15: lxc0ca0ec41ecd2@if14:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

So, we continued to dig. We’d eliminated ~99% of the packet loss, and the remaining loss rate wasn’t as significant as what we’d started with, but we still wanted to understand why it was occurring even after adjusting the RX ring buffer size of the NIC.

So what does dropped represent, and what is this lxc0ca0ec41ecd2 interface? dropped is similar to missed, but only occurs when packets are deliberately dropped by the kernel or network interface. Crucially though, it doesn’t tell you why a packet was dropped. As for the lxc0ca0ec41ecd2 interface, we use the Azure CNI Powered by Cilium to provide the network functionality to our AKS clusters. Any pod spun up on an AKS node gets a ‘logical’ interface, which is a virtual ethernet (veth) pair that connects the pod’s network namespace with the host’s network namespace. It was here that we were dropping packets.

In our experience, packet drops at this layer are unusual, so we started digging deeper into the cause of the drops. There are numerous ways you can debug why a packet is being dropped, but one of the easiest is to use perf attach to the skb:kfree_skb tracepoint. The "socket buffer" (skb) is the primary data structure used to represent network packets in the Linux kernel. When a packet is dropped, its corresponding socket buffer is usually freed, triggering the kfree_skb tracepoint. Using perf to attach to this event allowed us to capture stack traces to analyze the cause of the drops.

``` # perf record -g -a -e skb:kfree_skb ```

We left this to run for ~10 minutes or so to capture as many drops as possible, and then ‘heavily inspired’ by this GitHub Gist by Ivan Babrou, we converted the stack traces into an ‘easier’ to read Flamegraphs:

# perf script | sed -e 's/skb:kfree_skb:.*reason:\(.*\)/\n\tfffff \1 (unknown)/' -e 's/^\(\w\+\)\s\+/kernel /' > stacks.txt
cat stacks.txt | stackcollapse-perf.pl --all | perl -pe 's/.*?;//' | sed -e 's/.*irq_exit_rcu_\[k\];/irq_exit_rcu_[k];/' | flamegraph.pl --colors=java --hash --title=aks-k8s-node-1 --width=1440 --minwidth=0.005 > aks-k8s-node-1.svg

![AKS Node Packet Loss Flamegraph](/assets/images/debugging-aks-packet-loss/aks-packet-loss-flamegraph.png) _A Flamegraph showing the various stack trace ancestry of packet loss._

The flamegraph here shows how often different functions appeared in stack traces for packets drops. Each box represents a function call and wider boxes mean the function appears more frequently in the traces. The stack's ancestry builds upward from the bottom with earlier calls, to the top with later calls.

Firstly, we quickly discovered that unfortunately the skb_drop_reason enum was only added in Kernel 5.17 (Azure’s Node Image at the time was using 5.15). This meant that there was no single human readable message that told us why the packets were being dropped, instead all we got was NOT_SPECIFIED. To work out why packets were being dropped we needed to do a little sleuthing through the stack traces to work out what code paths were being taken when a packet was dropped.

In the flamegraph above you can see that many of the stack traces include veth driver function calls (e.g. veth_xmit), and many end abruptly with a call to the enqueue_to_backlog function. When many stacks end at the same function (like enqueue_to_backlog) it suggests that function is a common point where packets are being dropped. If you go back to the earlier explanation of what happens when a packet arrives at the NIC, you’ll notice that in step 7 we explained:

7. If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via enqueue_to_backlog). The maximum size of this backlog is controlled by the net.core.netdev_max_backlog sysctl.

Using the same privileged DaemonSet method for the RX ring buffer adjustment, we set the value of the net.core.netdev_max_backlog adjustable kernel parameter from 1000 to 32768:

/usr/sbin/sysctl -w net.core.netdev_max_backlog=32768

This value was based on the fact we knew the hosts were using a 100 Gb/s SR-IOV NIC, even if the VM was allowed only a fraction of the total bandwidth. We acknowledge that it’s worth revisiting this value in the future to see if it can be better optimised to not waste extraneous memory, but at the time “perfect was the enemy of good”.

We re-ran the load tests and compared the three sets of results we’d collected thus far.

![Final Indexing Rate Results](/assets/images/debugging-aks-packet-loss/indexing-rate-final.png) _A Kibana visualisation of Rally results, comparing impact to median throughput after each configuration change._

Tuning Step	Packet Loss	Median indexing throughput
Baseline	High	~18,000 docs/s
+RX Buffer	~99% drop ↓	~26,000 (+ ~40% from baseline)
+Backlog & +RX Buffer	Near zero	~29,000 (+ ~60% from baseline)

Here you can see the P50 of throughput in docs/s over the course of the hours-long load tests. Compared to the baseline, we saw a roughly ~40% increase in throughput by only adjusting the RX ring buffer values, and a ~50-60% increase with both the RX ring buffer and backlog changes! Hooray!

A great result and one more step on our journey towards better Serverless Elasticsearch performance.

Working with Azure

It’s great that we were able to quickly identify and mitigate the majority of our packet loss issues, but since we were using AKS with AKS node images, it made sense to engage with Azure to understand why the defaults weren’t working for our workload.

We walked Azure through our investigation, mitigations and results, and asked for some additional validation of our mitigations. Azure Engineering confirmed that the host NICs were not discarding packets, which confirmed that everything arriving at the host level was passed through to the hypervisor on the host. Further investigation confirmed that no loss or discards were occurring to Azure network fabric, or internal to the hypervisor – which shifted focus from the host to the guest OS and why the guest OS kernel was slow when reading packets off of the enP* SR-IOV interfaces.

Given the complexity of our load testing scenario — which involved configuring multiple systems and tools, including Elastic Observability, we also developed a simplified reproduction of the packet loss issue using iperf3. This simplified test was created specifically to share with Azure for targeted analysis, and added to the broader monitoring and analysis enabled by Elastic Observability and Rally.

With this reproduction Azure was able to confirm the increasing missed and dropped packet counters we had observed, and confirmed the increased RX ring buffer and netdev_max_backlog increase as the recommended mitigations.

Conclusion

While cloud providers offer various abstractions to manage your resources, the underlying hardware ultimately determines your application's performance and stability. High-performance hardware often requires tuning at the operating system level, well beyond the default settings most environments ship with. In managed platforms like AKS, where Azure controls both the node images and infrastructure, it is easy to overlook the impact of low-level configurations such as network device ring buffer sizes or sysctls like net.core.netdev_max_backlog.

Our experience shows that even with the convenience of a managed Kubernetes service, performance issues can still emerge if these hardware parameters are not tuned appropriately. It was tempting to assume that high-speed 100 Gb/s network interfaces, directly attached to the VM using SR-IOV would eliminate any chance of network-related bottlenecks. In reality, that assumption didn’t hold up.

Engaging early with Azure was essential, as they provided deeper visibility into the underlying infrastructure and worked with us to tune low-level, performance-critical settings. Combined with thorough load and scale testing and robust observability using tools like Elastic Observability, this collaboration helped us detect and rectify the issue early in order to deliver a consistent, reliable, and high-performing experience for our users.

Using Elastic to observe GKE Autopilot clusters

Wed, 15 Mar 2023 00:00:00 GMT

Elastic has formally supported Google Kubernetes Engine (GKE) since January 2020, when Elastic Cloud on Kubernetes was announced. Since then, Google has expanded GKE, with new service offerings and delivery mechanisms. One of those new offerings is GKE Autopilot. Where GKE is a managed Kubernetes environment, GKE Autopilot is a mode of Kubernetes operation where Google manages your cluster configuration, scaling, security, and more. It is production ready and removes many of the challenges associated with tasks like workload management, deployment automation, and scalability rules. Autopilot lets you focus on building and deploying your application while Google manages everything else.

Elastic is committed to supporting Google Kubernetes Engine (GKE) in all of its delivery modes. In October, during the Google Cloud Next ‘22 event, we announced our intention to integrate and certify Elastic Agent on Anthos, Autopilot, Google Distributed Cloud, and more.

Since that event, we have worked together with Google to get the Elastic Agent certified for use on Anthos, but we didn’t stop there.

Today we are happy to announce that we have been certified for operation on GKE Autopilot.

Hands on with Elastic and GKE Autopilot

Kubernetes observability has never been easier

To show how easy it is to get started with Autopilot and Elastic, let's walk through deploying the Elastic Agent on an Autopilot cluster. I’ll show how easy it is to set up and monitor an Autopilot cluster with the Elastic Agent and observe the cluster’s behavior with Kibana integrations.

One of the main differences between GKE and GKE Autopilot is that Autopilot protects the system namespace “kube-system.” To increase the stability and security of a cluster, Autopilot prevents user space workloads from adding or modifying system pods. The default configuration for Elastic Agent is to install itself into the system namespace. The majority of the changes we will make here are to convince the Elastic Agent to run in a different namespace.

Let’s get started with Elastic Stack!

While writing this article, I used the latest version of Elastic. The best way for you to get started with Elastic Observability is to:

Get an account on Elastic Cloud and look at this tutorial to help launch your first stack, or
Launch Elastic Cloud on your Google Account

Provisioning an Autopilot cluster and an Elastic stack

To test the agent, I first deployed the recommended, default GKE Autopilot cluster. Elastic’s GKE integration supports kube-state-metrics (KSM), which will increase the number of reported metrics available for reporting and dashboards. Like the Elastic Agent, KSM defaults to running in the system namespace, so I modified its manifest to work with Autopilot. For my testing, I also deployed a basic Elastic stack on Elastic Cloud in the same Google region as my Autopilot cluster. I used a fresh cluster deployed on Elastic’s managed service (ESS), but the process is the same if you are using an Elastic Cloud subscription purchased through the Google marketplace.

Adding Elastic Observability to GKE Autopilot

Because this is a brand new deployment, Elastic suggests adding integrations to it. Let’s add the Kubernetes integration into the new deployment:

Elastic offers hundreds of integrations; filter the list by typing “kub” into the search bar (1) and then click the Kubernetes integration (2).

The Kubernetes integration page gives you an overview of the integration and lets you manage the Kubernetes clusters you want to observe. We haven’t added a cluster yet, so I clicked “Add Kubernetes” to add the first integration.

I changed the integration name to reflect the Kubernetes offering type and then clicked “Save and continue” to accept the integration defaults.

At this point, an Agent policy has been created. Now it’s time to install the agent. I clicked on the “Kubernetes” integration.

Then I selected the “integration policies” tab (1) and clicked “Add agent” (2).

Finally, I downloaded the full manifest for a standard GKE environment.

We won’t be using this manifest directly, but it contains many of the values that we will need to deploy the agent on Autopilot in the next section.

The Elastic stack is ready and waiting for the Autopilot logs, metrics, and events. It’s time to connect Autopilot to this deployment using the Elastic Agent for GKE.

Connect Autopilot to Elastic

From the Google cloud terminal, I downloaded and edited the Elastic Agent manifest for GKE Autopilot.

$ curl -o elastic-agent-managed-gke-autopilot.yaml \
https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/manifests/elastic-agent-managed-gke-autopilot.yaml

I used the cloud shell editor to configure the manifest for my Autopilot and Elastic clusters. For example, I updated the following:

containers:
  - name: elastic-agent
    image: docker.elastic.co/beats/elastic-agent:8.19.11

I also changed the agent to the version of Elastic that I installed (8.6.0).

From the Integration manifest I downloaded earlier, I copied the values for FLEET_URL and FLEET_ENROLLMENT_TOKEN into this YAML file.

Now it’s time to apply the updated manifest to the Autopilot instance.

Before I commit, I always like to see what’s going to be created (and check for syntax errors) with a dry run.

$ clear
$ kubectl apply --dry-run="client" -f elastic-agent-managed-gke-autopilot.yaml

Everything looks good, so I’ll do it for real this time.

$ clear
$ kubectl apply -f elastic-agent-managed-gke-autopilot.yaml

After several minutes, metrics will start flowing from the Autopilot cluster directly into the Elastic deployment.

Adding a workload to the Autopilot cluster

Observing an Autopilot cluster without a workload is boring, so I deployed a modified version of Google’s Hipster Shop (which includes OpenTelemetry reporting):

$ git clone https://github.com/bshetti/opentelemetry-microservices-demo
$ cd opentelemetry-microservices-demo
$ nano ./deploy-with-collector-k8s/otelcollector.yaml

To get the application’s telemetry talking to our Elastic stack, I replaced all instances of the exporter type from HTTP (otlphttp/elastic) to gRPC (otlp/elastic). I then replaced OTEL_EXPORTER_OTLP_ENDPOINT with my APM endpoint and I replaced OTEL_EXPORTER_OTLP_HEADERS with my APM OTEL Bearer and Token.

Then I deployed the Hipster Shop.

$ kubectl create -f ./deploy-with-collector-k8s/adservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/redis.yaml
$ kubectl create -f ./deploy-with-collector-k8s/cartservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/checkoutservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/currencyservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/emailservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/frontend.yaml
$ kubectl create -f ./deploy-with-collector-k8s/paymentservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/productcatalogservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/recommendationservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/shippingservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/loadgenerator.yaml

Once all of the shop’s pods were running, I deployed the OpenTelemetry collector.

$ kubectl create -f ./deploy-with-collector-k8s/otelcollector.yaml

Observe and visualize Autopilot’s metrics

Now that we have added the Elastic Agent to our Autopilot cluster and added a workload, let's take a look at some of the Kubernetes visualizations the integration provides out of the box.

The “[Metrics Kubernetes] Overview” is a great place to start. It provides a high-level view of the resources used by the cluster and allows me to drill into more specific dashboards that I find interesting:

For example, the “[Metrics Kubernetes] Pods” gives me a high-level view of the pods deployed in the cluster:

The “[Metrics Kubernetes] Volumes” gives me an in-depth view to how storage is allocated and used in the Autopilot cluster:

Creating an alert

From here, I can easily discover patterns in my cluster’s behavior and even create Alerts. Here is an example of an alert to notify me if the the main storage volume (called “volume”) exceeds 80% of its allocated space:

With a little work, I created this view from the standard dashboard:

Conclusion

Today I have shown how easy it is to monitor, observe, and generate alerts on a GKE Autopilot cluster. To get more information on what is possible, see the official Elastic documentation for Autopilot observability with Elastic Agent.

Next steps

If you don’t have Elastic yet, you can get started for free with an Elastic Trial today. Get more from Elastic and Google together with a Marketplace subscription. Elastic does more than just integrate with GKE — check out the almost 300 integrations that Elastic provides.

Unlock possibilities with native OpenTelemetry: prioritize reliability, not proprietary limitations

Tue, 12 Nov 2024 00:00:00 GMT

OpenTelemetry (OTel) is emerging as the standard for data ingestion since it delivers a vendor-agnostic way to ingest data across all telemetry signals. Elastic Observability is leading the OTel evolution with the following announcements:

Native OTel Integrity: Elastic is now 100% OTel-native, retaining OTel data natively without requiring data translation This eliminates the need for SREs to handle tedious schema conversions and develop customized views. All Elastic Observability capabilities—such as entity discovery, entity-centric insights, APM, infrastructure monitoring, and AI-driven issue analysis— now seamlessly work with native OTel data.
Powerful end to end OTel based Kubernetes observability with Elastic Distributions of OpenTelemetry (EDOT): Elastic now supports EDOT deployment and management on Kubernetes via the OTel Operator, enabling streamlined EDOT collector deployment, application auto-instrumentation, and lifecycle management. With out-of-the-box OTel-based Kubernetes integration and dashboards, SREs gain instant, real-time visibility into cluster and application metrics, logs, and traces—with no manual configuration needed.

For organizations, it signals our commitment to open standards, streamlined data collection, and delivering insights from native OpenTelemetry data. Bring the power of Elastic Observability to your Kubernetes and OpenTelemetry deployments for maximum visibility and performance.

Fully native OTel architecture with in-depth data analysis

Elastic’s OpenTelemetry-first architecture is 100% OTel-native, fully retaining the OTel data model, including OTel Semantic Conventions and Resource attributes, so your observability data remains in OpenTelemetry standards. OTel data in Elastic is also backward compatible with the Elastic Common Schema (ECS).

SREs now gain a holistic view of resources, as Elastic accurately identifies entities through OTel resource attributes. For example, in a Kubernetes environment, Elastic identifies containers, hosts, and services and connects these entities to logs, metrics, and traces.

Once OTel data is in Elastic’s scalable vector datastore, Elastic’s capabilities such as the AI Assistant, zero-config machine learning-based anomaly detection, pattern analysis, and latency correlation empower SREs to quickly analyze and pinpoint potential issues in production environments.

Kubernetes insights with Elastic Distributions of OpenTelemetry (EDOT)

EDOT reduces manual effort through automated onboarding and pre-configured dashboards. With EDOT and OpenTelemetry, Elastic makes Kubernetes monitoring straightforward and accessible for organizations of any size.

EDOT paired with Elasticsearch, enables storage for all signal types—logs, metrics, traces, and soon profiling—while maintaining essential resource attributes and semantic conventions.

Elastic’s OpenTelemetry-native solution enables customers to quickly extract insights from their data rather than manage complex infrastructure to ingest data. Elastic automates the deployment and configuration of observability components to deliver a user experience focused on ease and scalability, making it well-suited for large-scale environments and diverse industry needs.

Let’s take a look at how Elastic’s EDOT enables visibility into Kubernetes environments.

1. Simple 3-step OTel ingest with lifecycle management and auto-instrumentation

Elastic leverages the upstream OpenTelemetry Operator to automate its EDOT lifecycle management—including deployment, scaling, and updates—allowing customers to focus on visibility into their Kubernetes infrastructure and applications instead of their observability infrastructure for data collection.

The Operator integrates with the EDOT Collector and language SDKs to provide a consistent, vendor-agnostic experience. For instance, when customers deploy a new application, they don’t need to manually configure instrumentation for various languages; the OpenTelemetry Operator manages this through auto-instrumentation, as supported by the upstream OpenTelemetry project.

This integration simplifies observability by ensuring consistent application instrumentation across the Kubernetes environment. Elastic’s collaboration with the upstream OpenTelemetry project strengthens this automation, enabling users to benefit from the latest updates and improvements in the OpenTelemetry ecosystem. By relying on open source tools like the OpenTelemetry Operator, Elastic ensures that its solutions stay aligned with the latest advancements in the OpenTelemetry project, reinforcing its commitment to open standards and community-driven development.

The diagram above shows how the operator can deploy multiple OTel collectors, helping SREs deploy individual EDOT Collectors for specific applications and infrastructure. This configuration improves availability for OTel ingest and the telemetry is sent directly to Elasticsearch servers via OTLP.

Check out our recent blog on how to set this up.

2. Out-of-the-box OTel-based Kubernetes integration with dashboards

Elastic delivers an OTel-based Kubernetes configuration for the OTel collector by packaging all necessary receivers, processors, and configurations for Kubernetes observability. This enables users to automatically collect, process, and analyze Kubernetes metrics, logs, and traces without the need to configure each component individually.

The OpenTelemetry Kubernetes Collector components provide essential building blocks, including receivers like the Kubernetes Receiver for cluster metrics, Kubeletstats Receiver for detailed node and container metrics, along with processors for data transformation and enrichment. By packaging these components, Elastic offers a turnkey solution that simplifies Kubernetes observability and eliminates the need for users to set up and configure individual collectors or processors.

This pre-packaged approach, which includes OTel-native Kibana assets such as dashboards, allows users to focus on analyzing their observability data rather than managing configuration details. Elastic’s Unified OpenTelemetry Experience ensures that users can harness OpenTelemetry’s full potential without needing deep expertise. Whether you’re monitoring resource usage, container health, or API server metrics, users gain comprehensive observability through EDOT.

For more details on OpenTelemetry Kubernetes Collector components, visit OpenTelemetry Collector Components.

3. Streamlined ingest architecture with OTel data and Elasticsearch

Elastic’s ingest architecture minimizes infrastructure overhead by enabling users to forward trace data directly into Elasticsearch with the EDOT Collector, removing the need for the Elastic APM server. This approach:

Reduces the costs and complexity associated with maintaining additional infrastructure, allowing users to deploy, scale, and manage their observability solutions with fewer resources.
Allows all OTel data, metrics, logs, and traces to be ingested and stored in Elastic’s singular vector database store enabling further analysis with Elastic’s AI-driven capabilities.

SREs can now reduce operational burdens while also gaining high performance analytics and observability insights provided by Elastic.

Elastic’s ongoing commitment to open source and OpenTelemetry

With Elasticsearch fully open source once again under the AGPL license, this change reinforces our deep commitment to open standards and the open source community. This aligns with Elastic’s OpenTelemetry-first approach to observability, where Elastic Distributions of OpenTelemetry (EDOT) streamline OTel ingestion and schema auto-detection, providing real-time insights for Kubernetes and application telemetry.

As users increasingly adopt OTel as their schema and data collection architecture for observability, Elastic’s Distribution of OpenTelemetry (EDOT), currently in tech preview, enhances standard OpenTelemetry capabilities and improves troubleshooting while also serving as a commercially supported OTel distribution. EDOT, together with Elastic’s recent contributions of the Elastic Profiling Agent and Elastic Common Schema (ECS) to OpenTelemetry, reinforces Elastic’s commitment to establishing OpenTelemetry as the industry standard.

Customers can now embrace open standards and enjoy the advantages of an open, extensible platform that integrates seamlessly with their environment. End result? Reduced costs, greater visibility, and vendor independence.

Getting hands-on with Elastic Observability and EDOT

Ready to try out the OTel Operator with EDOT collector and SDKs to see how Elastic utilizes ingested OTel data in APM, Discover, Analysis, and out-of-the-box dashboards?

If you have your own application and want to configure EDOT the application with auto-instrumentation, read the following blogs on Go, Java, PHP, Python

Native OTel-based K8s & App Observability in 3 Steps with Elastic

Wed, 13 Nov 2024 00:00:00 GMT

Elastic recently released its Elastic Distributions of OpenTelemetry (EDOT) which have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic. EDOT helps Elastic deliver its new Unified OpenTelemetry Experience. SRE’s are no longer burdened with a set of tedious steps instrumenting and ingesting OTel data into Observability. SREs get a simple and frictionless way to instrument the OTel collector, and applications, and ingest all the OTel data into Elastic. The components of this experience include: (detailed in the overview blog)

Elastic Distributions for OpenTelemetry (EDOT)
Elastic’s configuration for the OpenTelemetry Operator providing:
- OTel Lifecycle management for the OTel collector and SDKs
- Auto instrumentation of apps, which most developers will not instrument
Pre-packaged receivers, processors, exporters, and configuration for the OTel Kubernetes Collector
Out-of-the-box OTel-based K8S dashboards for metrics and logs
Discovered inventory views for services, hosts, and containers
Direct OTel ingest into Elasticsearch for EDOT (bypassing ingest into APM server) - all your data (logs, metrics, and traces) is now stored in Elastic’s Search AI Lake
All ingested OTel data used and displayed natively in Discovery, APM, Inventory, etc

In this blog we will cover how to ingest OTel for K8S and your application in 3 easy steps:

Copy the install commands from the UI
Add the OpenTelemetry helm charts, Install the OpenTelemetry Operator with Elastic’s helm configuration & set your Elastic endpoint and authentication
Annotate the app services you want to be auto-instrumented

Then you can easily see K8S metrics, logs and application logs, metrics, and traces in Elastic Observability.

To follow this blog you will need to have:

An account on cloud.elastic.co, with access to get the Elasticsearch endpoint and authentication (api key)
A non-instrumented application with services based on Go, dotnet, Python, or Java. Auto-instrumentation through the OTel operator. In this example, we will be using the Elastiflix application.
A Kubernetes cluster, we used EKS in our setup
Helm and Kubectl loaded

To find the authentication, you can find it in the integrations section of Elastic. More information is also available in the documentation.

K8S and Application Observability in Elastic:

Before we walk you through the steps, let's show you what is visible in Elastic.

Once the Operator starts the OTel Collector, you can see the following in Elastic:

Kubernetes metrics:

Using an out-of-the-box dashboard, you can see node metrics, overall cluster metrics, and status across pods, deployments, etc.

Discovered Inventory for Hosts, services, and containers:

This can be found at Observability->Inventory on the UI

Detailed metrics, logs, and processor info on hosts:

This can be found at Observability->Infrastructure->Hosts

K8S and application logs in Elastic’s New Discover (called Explorer)

This can be found on Observability->Discover

Application Service views (logs, metrics, and traces):

This can be found on Observability->Application

Then select the service and drill down into different aspects.

Above we are showing how traces are shown using Native OTel data.

Steps to install

Step 0. Follow the commands listed in the UI

Under Add data->Kubernetes->Kubernetes Monitoring with EDOT

You will find the following instructions, which we will follow here.

Step 1. Install the EDOT config for the OpenTelemetry Operator

Run the following commands. Please make sure that you have already authenticated in your K8s Cluster and this is where you will run the helm commands provided below.

# Install helm repo needed
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts --force-update
# Install needed secrets. Provide the Elasticsearch Endpoint URL and API key you have noted in previous steps
kubectl create ns opentelemetry-operator-system
kubectl create -n opentelemetry-operator-system secret generic elastic-secret-otel \
    --from-literal=elastic_endpoint='YOUR_ELASTICSEARCH_ENDPOINT' \
    --from-literal=elastic_api_key='YOUR_ELASTICSEARCH_API_KEY'
# Install the EDOT Operator
helm install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack --namespace opentelemetry-operator-system --create-namespace --values https://raw.githubusercontent.com/elastic/opentelemetry/refs/heads/main/resources/kubernetes/operator/helm/values.yaml --version 0.3.0

The values.yaml file configuration can be found here.

Step 1b: Ensure OTel data is arriving in Elastic

The simplest way to check is to go to Menu > Dashboards > [OTEL][Metrics Kubernetes] Cluster Overview, and ensure you see the following dashboard being populated

Step 2: Annotate the application with auto-instrumentation

For this example, we’re only going to annotate one service, the favorite-java service in the Elastiflix application

Use the following commands to initiate auto-instrumentation:

#Annotate Java namespace
kubectl annotate namespace java instrumentation.opentelemetry.io/inject-java="opentelemetry-operator-system/elastic-instrumentation"
#Restart the java-app to get the new annotation
kubectl rollout restart deployment java-app -n java

You can also modify the yaml for your pod with the annotation

metadata:
 name: my-app
 annotations:
   instrumentation.opentelemetry.io/inject-python: "true"

These instructions are provided in the UI:

Check out the service data in Elastic APM

Once the OTel data is in Elastic, you can see:

Out-of-the-box dashboards for OTel-based Kubernetes metrics
Discovered resources such as services, hosts, and containers that are part of the Kubernetes clusters
Kubernetes metrics, host metrics, logs, processor info, anomaly detection, and universal profiling.
Log analytics in Elastic Discover
APM features that show app overview, transactions, dependencies, errors, and more:

Try it out

Elastic’s Distribution of OpenTelemetry (EDOT) transforms the observability experience by streamlining Kubernetes and application instrumentation. With EDOT, SREs and developers can bypass complex setups, instantly gain deep visibility into Kubernetes clusters, and capture critical metrics, logs, and traces—all within Elastic Observability. By following just a few simple steps, you’re empowered with a unified, efficient monitoring solution that brings your OpenTelemetry data directly into Elastic. With robust, out-of-the-box dashboards, automatic application instrumentation, and seamless integration, EDOT not only saves time but also enhances the accuracy and accessibility of observability across your infrastructure. Start leveraging EDOT today to unlock a frictionless observability experience and keep your systems running smoothly and insightfully.

Additional resources:

How to enable Kubernetes alerting with Elastic Observability

Tue, 30 May 2023 00:00:00 GMT

In the Kubernetes world, different personas demand different kinds of insights. Developers are interested in granular metrics and debugging information. SREs are interested in seeing everything at once to quickly get notified when a problem occurs and spot where the root cause is. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.

Why do we need alerts?

Logs, metrics, and traces are just the base to build a complete monitoring solution for Kubernetes clusters. Their main goal is to provide debugging information and historical evidence for the infrastructure.

While out-of-the-box dashboards, infrastructure topology, and logs exploration through Kibana are already quite handy to perform ad-hoc analyses, adding notifications and active monitoring of infrastructure allows users to deal with problems detected as early as possible and even proactively take actions to prevent their Kubernetes environments from facing even more serious issues.

How can this be achieved?

By building alerts on top of their infrastructure, users can leverage the data and effectively correlate it to a specific notification, creating a wide range of possibilities to dynamically monitor and observe their Kubernetes cluster.

In this blog post, we will explore how users can leverage Elasticsearch’s search powers to define alerting rules in order to be notified when a specific condition occurs.

SLIs, alerts, and SLOs: Why are they important for SREs?

For site reliability engineers (SREs), the incident response time is tightly coupled with the success of everyday work. Monitoring, alerting, and actions will help to discover, resolve, or prevent issues in their systems.

An SLA (Service Level Agreement) is an agreement you create with your users to specify the level of service they can expect.

An SLO (Service Level Objective) is an agreement within an SLA about a specific metric like uptime or response time.

An SLI (Service Level Indicator) measures compliance with an SLO.

SREs’ day-to-day tasks and projects are driven by SLOs. By ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term, we lay the basis of a stable working infrastructure.

Having said this, identifying the high-level categories of SLOs is crucial in order to organize the work of an SRE. Then in each category of SLOs, SREs will need the corresponding SLIs that can cover the most important cases of their system under observation. Therefore, the decision of which SLIs we will need demands additional knowledge of the underlying system infrastructure.

One widely used approach to categorize SLIs and SLOs is the Four Golden Signals method. The categories defined are Latency, Traffic, Errors, and Saturation.

A more specific approach is the The RED method developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation category because this one is mainly used for more advanced cases — and people remember better things that come in threes.

Focusing on Kubernetes infrastructure operators, we will consider the following groups of infrastructure SLIs/SLOs:

Group 1: Latency of control plane (apiserver,
Group 2: Resource utilization of the nodes/pods (how much cpu, memory, etc. is consumed)
Group 3: Errors (errors on logs or events or error count from components, network, etc.)

Creating alerts for a Kubernetes cluster

Now that we have a complete outline of our goal to define alerts based on SLIs/SLOs, we will dive into defining the proper alerting. Alerts can be built using Kibana.

See Elastic documentation.

In this blog, we will define more complex alerts based on complex Elasticsearch queries provided by Watcher’s functionality. Read more about Watcher and how to properly use it in addition to the examples in this blog.

Latency alerts

For this kind of alert, we want to define the basic SLOs for a Kubernetes control plane, which will ensure that the basic control plane components can service the end users without an issue. For instance, facing high latencies in queries against the Kubernetes API Server is enough of a signal that action needs to be taken.

Resource saturation

The next group of alerting will be resource utilization. Node’s CPU utilization or changes in Node’s condition is something critical for a cluster to ensure the smooth servicing of the workloads provisioned to run the applications that end users will interact with.

Error detection

Last but not least, we will define alerts based on specific errors like the network error rate or Pods’ failures like the OOMKilled situation. It’s a very useful indicator for SRE teams to either detect issues on the infrastructure level or just be able to notify developer teams about problematic workloads. One example that we will examine later is having an application running as a Pod and constantly getting restarted because it hits its memory limit. In that case, the owners of this application will need to get notified to act properly.

From Kubernetes data to Elasticsearch queries

Having a solid plan about the alerts that we want to implement, it's time to explore the data we have collected from the Kubernetes cluster and stored in Elasticsearch. For this we will consult the list of the available data fields that are ingested using the Elastic Agent Kubernetes integration (the full list of fields can be found here). Using these fields we can create various alerts like:

Node CPU utilization
Node Memory utilization
BW utilization
Pod restarts
Pod CPU/memory utilization

CPU utilization alert

Our first example will use the CPU utilization fields to calculate the Node’s CPU utilization and create an alert. For this alert, we leverage the metrics:

kubernetes.node.cpu.usage.nanocores
kubernetes.node.cpu.capacity.cores.

The following calculation (nodeUsage / 1000000000 ) /nodeCap grouped by node name will give us the CPU utilization of our cluster’s nodes.

The Watcher definition that implements this query can be created with the following API call to Elasticsearch:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Node-CPU-Usage?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.node OR data_stream.dataset: kubernetes.state_node",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "nodeUsage": {
                  "max": {
                    "field": "kubernetes.node.cpu.usage.nanocores"
                  }
                },
                "nodeCap": {
                  "max": {
                    "field": "kubernetes.node.cpu.capacity.cores"
                  }
                },
                "nodeCPUUsagePCT": {
                  "bucket_script": {
                    "buckets_path": {
                      "nodeUsage": "nodeUsage",
                      "nodeCap": "nodeCap"
                    },
                    "script": {
                      "source": "( params.nodeUsage / 1000000000 ) / params.nodeCap",
                      "lang": "painless",
                      "params": {
                        "_interval": 10000
                      }
                    },
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.nodes.buckets": {
        "path": "nodeCPUUsagePCT.value",
        "gte": {
          "value": 80
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found with high CPU usage: {{ctx.payload.key}} -> {{ctx.payload.nodeCPUUsagePCT.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node CPU Usage"
  }
}

OOMKilled Pods detection and alerting

Another Watcher that we will explore is the one that detects Pods that have been restarted due to an OOMKilled error. This error is quite common in Kubernetes workloads and is useful to detect this early on to inform the team that owns this workload, so they can either investigate issues that could cause memory leaks or just consider increasing the required resources for the workload itself.

This information can be retrieved from a query like the following:

kubernetes.container.status.last_terminated_reason: OOMKilled

Here is how we can create the respective Watcher with an API call:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: OOMKilled",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/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",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated OOMKilled"
  }
}

From Kubernetes data to alerts summary

So far we saw how we can start from plain Kubernetes fields, use them in ES queries, and build Watchers and alerts on top of them.

One can explore more possible data combinations and build queries and alerts following the examples we provided here. A full list of alerts is available, as well as a basic scripted way of installing them.

Of course, these examples come with simple actions defined that only log messages into the Elasticsearch logs. However, one can use more advanced and useful outputs like Slack’s webhooks:

"actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  }

The result would be a Slack message like the following:

Next steps

In our next steps, we would like to make these alerts part of our Kubernetes integration, which would mean that the predefined alerts would be installed when users install or enable the Kubernetes integration. At the same time, we plan to implement some of these as Kibana’s native SLIs, providing the option to our users to quickly define SLOs on top of the SLIs through a nice user interface. If you’re interested to learn more about these, follow the public GitHub issues for more information and feel free to provide your feedback:

For those who are eager to start using Kubernetes alerting today, here is what you need to do:

Make sure that you have an Elastic cluster up and running. The fastest way to deploy your cluster is to spin up a free trial of Elasticsearch Service.
Install the latest Elastic Agent on your Kubernetes cluster following the respective documentation.
Install our provided alerts that can be found at https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs or at https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting.

Of course, if you have any questions, remember that we are always happy to help on the Discuss forums.

How to easily add application monitoring in Kubernetes pods

Wed, 17 Jan 2024 00:00:00 GMT

The Elastic® APM K8s Attacher allows auto-installation of Elastic APM application agents (e.g., the Elastic APM Java agent) into applications running in your Kubernetes clusters. The mechanism uses a mutating webhook, which is a standard Kubernetes component, but you don’t need to know all the details to use the Attacher. Essentially, you can install the Attacher, add one annotation to any Kubernetes deployment that has an application you want monitored, and that’s it!

In this blog, we’ll walk through a full example from scratch using a Java application. Apart from the Java code and using a JVM for the application, everything else works the same for the other languages supported by the Attacher.

Prerequisites

This walkthrough assumes that the following are already installed on the system: JDK 17, Docker, Kubernetes, and Helm.

The example application

While the application (shown below) is a Java application, it would be easily implemented in any language, as it is just a simple loop that every 2 seconds calls the method chain methodA->methodB->methodC->methodD, with methodC sleeping for 10 milliseconds and methodD sleeping for 200 milliseconds. The choice of application is just to be able to clearly display in the Elastic APM UI that the application is being monitored.

The Java application in full is shown here:

package test;

public class Testing implements Runnable {

  public static void main(String[] args) {
    new Thread(new Testing()).start();
  }

  public void run()
  {
    while(true) {
      try {Thread.sleep(2000);} catch (InterruptedException e) {}
      methodA();
    }
  }

  public void methodA() {methodB();}

  public void methodB() {methodC();}

  public void methodC() {
    System.out.println("methodC executed");
    try {Thread.sleep(10);} catch (InterruptedException e) {}
    methodD();
  }

  public void methodD() {
    System.out.println("methodD executed");
    try {Thread.sleep(200);} catch (InterruptedException e) {}
  }
}

We created a Docker image containing that simple Java application for you that can be pulled from the following Docker repository:

docker.elastic.co/demos/apm/k8s-webhook-test

Deploy the pod

First we need a deployment config. We’ll call the config file webhook-test.yaml, and the contents are pretty minimal — just pull the image and run that as a pod & container called webhook-test in the default namespace:

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

This can be deployed normally using kubectl:

kubectl apply -f webhook-test.yaml

The result is exactly as expected:

$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
webhook-test   1/1     Running   0          10s

$ kubectl logs webhook-test
methodC executed
methodD executed
methodC executed
methodD executed

So far, this is just setting up a standard Kubernetes application with no APM monitoring. Now we get to the interesting bit: adding in auto-instrumentation.

Install Elastic APM K8s Attacher

The first step is to install the Elastic APM K8s Attacher. This only needs to be done once for the cluster — once installed, it is always available. Before installation, we will define where the monitored data will go. As you will see later, we can decide or change this any time. For now, we’ll specify our own Elastic APM server, which is at https://myserver.somecloud:443 — we also have a secret token for authorization to that Elastic APM server, which has value MY_SECRET_TOKEN. (If you want to set up a quick test Elastic APM server, you can do so at https://cloud.elastic.co/).

There are two additional environment variables set for the application that are not generally needed but will help when we see the resulting UI content toward the end of the walkthrough (when the agent is auto-installed, these two variables tell the agent what name to give this application in the UI and what method to trace). Now we just need to define the custom yaml file to hold these. On installation, the custom yaml will be merged into the yaml for the Attacher:

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"

That custom.yaml file is all we need to install the attacher (note we’ve only specified the default namespace for agent auto-installation for now — this can be easily changed, as you’ll see later). Next we’ll add the Elastic charts to helm — this only needs to be done once, then all Elastic charts are available to helm. This is the usual helm add repo command, specifically:

helm repo add elastic https://helm.elastic.co

Now the Elastic charts are available for installation (helm search repo would show you all the available charts). We’re going to use “elastic-webhook” as the name to install into, resulting in the following installation command:

helm install elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

And that’s it, we now have the Elastic APM K8s Attacher installed and set to send data to the APM server defined in the custom.yaml file! (You can confirm installation with a helm list -A if you need.)

Auto-install the Java agent

The Elastic APM K8s Attacher is installed, but it doesn’t auto-install the APM application agents into every pod — that could lead to problems! Instead the Attacher is deliberately limited to auto-install agents into deployments defined a) by the namespaces listed in the custom.yaml, and b) to those deployments in those namespaces that have a specific annotation “co.elastic.apm/attach.”

So for now, restarting the webhook-test pod we created above won’t have any different effect on the pod, as it isn’t yet set to be monitored. What we need to do is add the annotation. Specifically, we need to add the annotation using the default agent configuration that was installed with the Attacher called “java” for the Java agent (we’ll see later how that agent configuration is altered — the default configuration installs the latest agent version and leaves everything else default for that version). So adding that annotation in to webhook-test yaml gives us the new yaml file contents (the additional config is shown labelled (1)):

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  annotations: #(1)
    co.elastic.apm/attach: java #(1)
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

Applying this change gives us the application now monitored:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.45.0 …

And since the agent is now feeding data to our APM server, we can now see it in the UI:

Note that the agent identifies Testing.methodB method as a trace root because of the ELASTIC_APM_TRACE_METHODS environment variable set to test.Testing#methodB in the custom.yaml — this tells the agent to specifically trace that method. The time taken by that method will be available in the UI for each invocation, but we don’t see the sub-methods . . . currently. In the next section, we’ll see how easy it is to customize the Attacher, and in doing so we’ll see more detail about the method chain being executed in the application.

Customizing the agents

In your systems, you’ll likely have development, testing, and production environments. You’ll want to specify the version of the agent to use rather than just pull the latest version whatever that is, you’ll want to have debug on for some applications or instances, and you’ll want to have specific options set to specific values. This sounds like a lot of effort, but the attacher lets you enable these kinds of changes in a very simple way. In this section, we’ll add a configuration that specifies all these changes and we can see just how easy it is to configure and enable it.

We start at the custom.yaml file we defined above. This is the file that gets merged into the Attacher. Adding a new configuration with all the items listed in the last paragraph is easy — though first we need to decide a name for our new configuration. We’ll call it “java-interesting” here. The new custom.yaml in full is (the first part is just the same as before, the new config is simply appended):

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
    java-interesting:
      image: docker.elastic.co/observability/apm-agent-java:1.55.4
      artifact: "/usr/agent/elastic-apm-agent.jar"
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
        ELASTIC_APM_ENVIRONMENT: "testing"
        ELASTIC_APM_LOG_LEVEL: "debug"
        ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED: "true"
        JAVA_TOOL_OPTIONS: "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar"

Breaking the additional config down, we have:

The name of the new config java-interesting
The APM Java agent image docker.elastic.co/observability/apm-agent-java
- With a specific version 1.43.0 instead of latest
We need to specify the agent jar location (the attacher puts it here)
- artifact: "/usr/agent/elastic-apm-agent.jar"
And then the environment variables
ELASTIC_APM_SERVER_URL as before
ELASTIC_APM_ENVIRONMENT set to testing, useful when looking in the UI
ELASTIC_APM_LOG_LEVEL set to debug for more detailed agent output
ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED turning this on (setting to true) will give us additional interesting information about the method chain being executed in the application
And lastly we need to set JAVA_TOOL_OPTIONS to the enable starting the agent "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar" — this is fundamentally how the attacher auto-attaches the Java agent

More configurations and details about configuration options are here for the Java agent, and other language agents are also available.

The application traced with the new configuration

And finally we just need to upgrade the attacher with the changed custom.yaml:

helm upgrade elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

This is the same command as the original install, but now using upgrade. That’s it — add config to the custom.yaml and upgrade the attacher, and it’s done! Simple.

Of course we still need to use the new config on an app. In this case, we’ll edit the existing webhook-test.yaml file, replacing java with java-interesting, so the annotation line is now:

co.elastic.apm/attach: java-interesting

Applying the new pod config and restarting the pod, you can see the logs now hold debug output:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.44.0 …
… DEBUG co.elastic.apm.agent. …
… DEBUG co.elastic.apm.agent. …

More interesting is the UI. Now that inferred spans is on, the full method chain is visible.

This gives the details for methodB (it takes 211 milliseconds because it calls methodC - 10ms - which calls methodD - 200ms). The times for methodC and methodD are inferred rather than recorded, (inferred rather than traced — if you needed accurate times you would instead add the methods to trace_methods and have them traced too).

Note on the ECK operator

The Elastic Cloud on Kubernetes operator allows you to install and manage a number of other Elastic components on Kubernetes. At the time of publication of this blog, the Elastic APM K8s Attacher is a separate component, and there is no conflict between these management mechanisms — they apply to different components and are independent of each other.

Try it yourself!

This walkthrough is easily repeated on your system, and you can make it more useful by replacing the example application with your own and the Docker registry with the one you use.

Learn more about real-time monitoring with Kubernetes and Elastic Observability.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

How to monitor Kafka and Confluent Cloud with Elastic Observability

Mon, 03 Apr 2023 00:00:00 GMT

The blog will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability. (To monitor Kafka brokers that are not in Confluent Cloud, I recommend checking out this blog.) We will instrument Kafka applications with Elastic APM, use the Confluent Cloud metrics endpoint to get data about brokers, and pull it all together with a unified Kafka and Confluent Cloud monitoring dashboard in Elastic Observability.

Using full-stack Elastic Observability to understand Kafka and Confluent performance

In the 2023 Dice Tech Salary Report, Elasticsearch and Kakfa are ranked #3 and #5 out of the top 12 most in demand skills at the moment, so it’s no surprise that we are seeing a large number of customers who are implementing data in motion with Kafka.

Kafka comes with some additional complexities that go beyond traditional architectures and which make observability an even more important topic. Understanding where the bottlenecks are in messaging and stream-based architectures can be tough. This is why you need a comprehensive observability solution with machine learning to help you.

In this blog, we will explore how to get Kafka applications instrumented with Elastic APM, how to collect performance data with JMX, and how you can use the Elasticsearch Platform to pull in data from Confluent Cloud — which is by far the easiest and most cost-effective way to implement Kafka architectures.

For this blog post, we will be following the code at this git repository. There are three services here that are designed to run on two clouds and push data from one cloud to the other and finally into Google BigQuery. We want to monitor all of this using Elastic Observability to give you a complete picture of Confluent and Kafka Services performance as a teaser — this is the goal below:

A look at the architecture

As mentioned, we have three multi-cloud services implemented in our example application.

The first service is a Spring WebFlux service that runs inside AWS EKS. This service will take a message from a REST Endpoint and simply put it straight on to a Kafka topic.

The second service, which is also a Spring WebFlux service hosted inside Google Cloud Platform (GCP) with its Google Cloud monitoring, will then pick this up and forward it to another service that will put the message into BigQuery.

These services are all instrumented using Elastic APM. For this blog, we have decided to use Spring config to inject and configure the APM agent. You could of course use the “-javaagent” argument to inject the agent instead if preferred.

Getting started with Elastic Observability and Confluent Cloud

Before we dive into the application and its configuration, you will want to get an Elastic Cloud and Confluent Cloud account. You can sign up here for Elastic and here for Confluent Cloud. There are some initial configuration steps we need to do inside Confluent Cloud, as you will need to create three topics: gcpTopic, myTopic, and topic_2.

When you sign up for Confluent Cloud, you will be given an option of what type of cluster to create. For this walk-through, a Basic cluster is fine (as shown) — if you are careful about usage, it will not cost you a penny.

Once you have a cluster, go ahead and create the three topics.

For this walk-through, you will only need to create single partition topics as shown below:

Now we are ready to set up the Elastic Cloud cluster.

One thing to note here is that when setting up an Elastic cluster, the defaults are mostly OK. With one minor tweak to add in the Machine Learning under “Advanced Settings,” add capacity for machine learning here.

Getting APM up and running

The first thing we want to do here is get our Spring Boot Webflux-based services up and running. For this blog, I have decided to implement this using the Spring Configuration, as you can see below. For brevity, I have not listed all the JMX configuration information, but you can see those details in GitHub.

package com.elastic.multicloud;
import co.elastic.apm.attach.ElasticApmAttacher;
import jakarta.annotation.PostConstruct;
import lombok.Setter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

import java.util.HashMap;
import java.util.Map;

@Setter
@Configuration
@ConfigurationProperties(prefix = "elastic.apm")
@ConditionalOnProperty(value = "elastic.apm.enabled", havingValue = "true")
public class ElasticApmConfig {

    private static final String SERVER_URL_KEY = "server_url";
    private String serverUrl;

    private static final String SERVICE_NAME_KEY = "service_name";
    private String serviceName;

    private static final String SECRET_TOKEN_KEY = "secret_token";
    private String secretToken;

    private static final String ENVIRONMENT_KEY = "environment";
    private String environment;

    private static final String APPLICATION_PACKAGES_KEY = "application_packages";
    private String applicationPackages;

    private static final String LOG_LEVEL_KEY = "log_level";
    private String logLevel;
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticApmConfig.class);

    @PostConstruct
    public void init() {
        LOGGER.info(environment);

        Map apmProps = new HashMap<>(6);
        apmProps.put(SERVER_URL_KEY, serverUrl);
        apmProps.put(SERVICE_NAME_KEY, serviceName);
        apmProps.put(SECRET_TOKEN_KEY, secretToken);
        apmProps.put(ENVIRONMENT_KEY, environment);
        apmProps.put(APPLICATION_PACKAGES_KEY, applicationPackages);
        apmProps.put(LOG_LEVEL_KEY, logLevel);
        apmProps.put("enable_experimental_instrumentations","true");
          apmProps.put("capture_jmx_metrics","object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]");


        ElasticApmAttacher.attach(apmProps);
    }
}

Now obviously this requires some dependencies, which you can see here in the Maven pom.xml.


			co.elastic.apm
			apm-agent-attach
			1.35.1-SNAPSHOT
		
		
			co.elastic.apm
			apm-agent-api
			1.35.1-SNAPSHOT

Strictly speaking, the agent-api is not required, but it could be useful if you have a desire to add your own monitoring code (as per the example below). The agent will happily auto-instrument without needing to do that though.

Transaction transaction = ElasticApm.currentTransaction();
        Span span = ElasticApm.currentSpan()
                .startSpan("external", "kafka", null)
                .setName("DAVID").setServiceTarget("kafka","gcp-elastic-apm-spring-boot-integration");
        try (final Scope scope = transaction.activate()) {
            span.injectTraceHeaders((name, value) -> producerRecord.headers().add(name,value.getBytes()));
            return Mono.fromRunnable(() -> {
                kafkaTemplate.send(producerRecord);
            });
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        } finally {
            span.end();
        }

Now we have enough code to get our agent bootstrapped.

To get the code from the GitHub repository up and running, you will need the following installed on your system and to ensure that you have the credentials for your GCP and AWS cloud.


Java
Maven
Docker
Kubernetes CLI (kubectl)

Clone the project

Clone the multi-cloud Spring project to your local machine.

git clone https://github.com/davidgeorgehope/multi-cloud

Build the project

From each service in the project (aws-multi-cloud, gcp-multi-cloud, gcp-bigdata-consumer-multi-cloud), run the following commands to build the project.

mvn clean install

Now you can run the Java project locally.

java -jar gcp-bigdata-consumer-multi-cloud-0.0.1-SNAPSHOT.jar --spring.config.location=/Users/davidhope/applicaiton-gcp.properties

That will just get the Java application running locally, but you can also deploy this to Kubernetes using EKS and GKE as shown below.

Create a Docker image

Create a Docker image from the built project using the dockerBuild.sh provided in the project. You may want to customize this shell script to upload the built docker image to your own docker repository.

./dockerBuild.sh

Create a namespace for each service

kubectl create namespace aws

kubectl create namespace gcp-1

kubectl create namespace gcp-2

Once you have the namespaces created, you can switch context using the following command:

kubectl config set-context --current --namespace=my-namespace

Configuration for each service

Each service needs an application.properties file. I have put an example here.

You will need to replace the following properties with those you find in Elastic.

elastic.apm.server-url=
elastic.apm.secret-token=

These can be found by going into Elastic Cloud and clicking on Services inside APM and then Add Data , which should be visible in the top right corner.

From there you will see the following, which gives you the config information you need.

You will need to replace the following properties with those you find in Confluent Cloud.

elastic.kafka.producer.sasl-jaas-config=

This configuration comes from the Clients page in Confluent Cloud.

Adding the config for each service in Kubernetes

Once you have a fully configured application properties, you need to add it to your Kubernetes environment as below.

From the aws namespace.

kubectl create secret generic my-app-config --from-file=application.properties

From the gcp-1 namespace.

kubectl create secret generic my-app-config --from-file=application.properties

From the gcp-2 namespace.

kubectl create secret generic bigdata-creds --from-file=elastic-product-marketing-e145e13fbc7c.json

kubectl create secret generic my-app-config-gcp-bigdata --from-file=application.properties

Create a Kubernetes deployment

Create a Kubernetes deployment YAML file and add your Docker image to it. You can use the deployment.yaml file provided in the project as a template. Make sure to update the image name in the file to match the name of the Docker image you just created.

kubectl apply -f deployment.yaml

Create a Kubernetes service

Create a Kubernetes service YAML file and add your deployment to it. You can use the service.yaml file provided in the project as a template.

kubectl apply -f service.yaml

Access your application

Your application is now running in a Kubernetes cluster. To access it, you can use the service's cluster IP and port. You can get the service's IP and port using the following command.

kubectl get services

Now once you know where the service is, you need to execute it!

You can regularly poke the service endpoint using the following command.

curl -X POST -H "Content-Type: application/json" -d '{"name": "linuxize", "email": "linuxize@example.com"}' http://localhost:8080/api/my-objects/publish

With this up and running, you should see the following service map build out in the Elastic APM product.

And traces will contain a waterfall graph showing all the spans that have executed across this distributed application, allowing you to pinpoint where any issues are within each transaction.

JMX for Kafka Producer/Consumer metrics

In the previous part of this blog, we briefly touched on the JMX metric configuration you can see below.

"capture_jmx_metrics","object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]"

We can use this “capture_jmx_metrics” configuration to configure JMX for any Kafka Producer/Consumer metrics we want to monitor.

Check out the documentation here to understand how to configure this and here to see the available JMX metrics you can monitor. In the example code in GitHub, we actually pull all the available metrics in, so you can check in there how to configure this.

One thing that’s worth pointing out here is that it’s important to use the “metric_name” property shown above or it gets quite difficult to find the metrics in Elastic Discover without being specific here.

Monitoring Confluent Cloud with Elastic Observability

So we now have some good monitoring set up for Kafka Producers and Consumers and we can trace transactions between services down to the lines of code that are executing. The core part of our Kafka infrastructure is hosted in Confluent Cloud. How, then, do we get data from there into our full stack observability solution?

Luckily, Confluent has done a fantastic job of making this easy. It provides important Confluent Cloud metrics via an open Prometheus-based metrics URL. So let's get down to business and configure this to bring data into our observability tool.

The first step is to configure Confluent Cloud with the MetricsViewer. The MetricsViewer role provides service account access to the Metrics API for all clusters in an organization. This role also enables service accounts to import metrics into third-party metrics platforms.

To assign the MetricsViewer role to a new service account:

In the top-right administration menu (☰) in the upper-right corner of the Confluent Cloud user interface, click ADMINISTRATION > Cloud API keys.
Click Add key.
Click the Granular access tile to set the scope for the API key. Click Next.
Click Create a new one and specify the service account name. Optionally, add a description. Click Next.
The API key and secret are generated for the service account. You will need this API key and secret to connect to the cluster, so be sure to safely store this information. Click Save. The new service account with the API key and associated ACLs is created. When you return to the API access tab, you can view the newly-created API key to confirm.
Return to Accounts & access in the administration menu, and in the Accounts tab, click Service accounts to view your service accounts.
Select the service account that you want to assign the MetricsViewer role to.
In the service account’s details page, click Access.
In the tree view, open the resource where you want the service account to have the MetricsViewer role.
Click Add role assignment and select the MetricsViewer tile. Click Save.

Next we can head to Elastic Observability and configure the Prometheus integration to pull in the metrics data.

Go to the integrations page in Kibana.

Find the Prometheus integration. We are using the Prometheus integration because the Confluent Cloud metrics server can provide data in prometheus format. Trust us, this works really well — good work Confluent!

Add Prometheus in the next page.

Configure the Prometheus plugin in the following way: In the hosts box, add the following URL, replacing the resource kafka id with the cluster id you want to monitor.

https://api.telemetry.confluent.cloud:443/v2/metrics/cloud/export?resource.kafka.id=lkc-3rw3gw

Add the username and password under the advanced options you got from the API keys step you executed against Confluent Cloud above.

Once the Integration is created, the policy needs to be applied to an instance of a running Elastic Agent.

That’s it! It’s that easy to get all the data you need for a full stack observability monitoring solution.

Finally, let’s pull all this together in a dashboard.

Pulling it all together

Using Kibana to generate dashboards is super easy. If you configured everything the way we recommended above, you should find the metrics (producer/consumer/brokers) you need to create your own dashboard as per the following screenshot.

Luckily, I made a dashboard for you and stored it in GitHub. Take a look below and use this to import it into your own environments.

Adding the icing on the cake: machine learning anomaly detection

Now that we have all the critical bits in place, we are going to add the icing on the cake: machine learning (ML)!

Within Kibana, let's head over to the Machine Learning tab in “Analytics.”

Go to the jobs page, where we’ll get started creating our first anomaly detection job.

The metrics data view contains what we need to create this new anomaly detection job.

Use the wizard and select a “Single Metric.”

Use the full data.

In this example, we are going to look for anomalies in the connection count. We really do not want a major deviation here, as this could indicate something very bad occurring if we suddenly have too many or too few things connecting to our Kafka cluster.

Once you have selected the connection count metric, you can proceed through the wizard and eventually your ML job will be created and you should be able to view the data as per the example below.

Congratulations, you have now created a machine learning job to alert you if there are any problems with your Kafka cluster, adding a full AIOps solution to your Kafka and Confluent observability!

Summary

We looked at monitoring Kafka-based solutions implemented on Confluent Cloud using Elastic Observability.

We covered the architecture of a multi-cloud solution involving AWS EKS, Confluent Cloud, and GCP GKE. We looked at how to instrument Kafka applications with Elastic APM, use JMX for Kafka Producer/Consumer metrics, integrate Prometheus, and set up machine learning anomaly detection.

We went through a detailed walk-through with code snippets, configuration steps, and deployment instructions included to help you get started.

Interested in learning more about Elastic Observability? Check out the following resources:

And sign up for our Elastic Observability Trends Webinar featuring AWS and Forrester, not to be missed!

Ingesting and analyzing Prometheus metrics with Elastic Observability

Mon, 09 Oct 2023 00:00:00 GMT

In the world of monitoring and observability, Prometheus has grown into the de-facto standard for monitoring in cloud-native environments because of its robust data collection mechanism, flexible querying capabilities, and integration with other tools for rich dashboarding and visualization.

Prometheus is primarily built for short-term metric storage, typically retaining data in-memory or on local disk storage, with a focus on real-time monitoring and alerting rather than historical analysis. While it offers valuable insights into current metric values and trends, it may pose economic challenges and fall short of the robust functionalities and capabilities necessary for in-depth historical analysis, long-term trend detection, and forecasting. This is particularly evident in large environments with a substantial number of targets or high data ingestion rates, where metric data accumulates rapidly.

Numerous organizations assess their unique needs and explore avenues to augment their Prometheus monitoring and observability capabilities. One effective approach is integrating Prometheus with Elastic®. In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.

Integrate Prometheus with Elastic seamlessly

Organizations that have configured their cloud-native applications to expose metrics in Prometheus format can seamlessly transmit the metrics to Elastic by using Prometheus integration. Elastic enables organizations to monitor their metrics in conjunction with all other data gathered through Elastic's extensive integrations.

Go to Integrations and find the Prometheus integration.

To gather metrics from Prometheus servers, the Elastic Agent is employed, with central management of Elastic agents handled through the Fleet server.

After enrolling the Elastic Agent in the Fleet, users can choose from the following methods to ingest Prometheus metrics into Elastic.

1. Prometheus collectors

The Prometheus collectors connect to the Prometheus server and pull metrics or scrape metrics from a Prometheus exporter.

2. Prometheus queries

The Prometheus queries execute specific Prometheus queries against Prometheus Query API.

3. Prometheus remote-write

The Prometheus remote_write can receive metrics from a Prometheus server that has configured the remote_write setting.

After your Prometheus metrics are ingested, you have the option to visualize your data graphically within the Metrics Explorer and further segment it based on labels, such as hosts, containers, and more.

You can also query your metrics data in Discover and explore the fields of your individual documents within the details panel.

Storing historical metrics with Elastic’s data tiering mechanism

By exporting Prometheus metrics to Elasticsearch, organizations can extend the retention period and gain the ability to analyze metrics historically. Elastic optimizes data storage and access based on the frequency of data usage and the performance requirements of different data sets. The goal is to efficiently manage and store data, ensuring that it remains accessible when needed while keeping storage costs in check.

After ingesting Prometheus metrics data, you have various retention options. You can set the duration for data to reside in the hot tier, which utilizes high IO hardware (SSD) and is more expensive. Alternatively, you can move the Prometheus metrics to the warm tier, employing cost-effective hardware like spinning disks (HDD) while maintaining consistent and efficient search performance. The cold tier mirrors the infrastructure of the warm tier for primary data but utilizes S3 for replica storage. Elastic automatically recovers replica indices from S3 in case of node or disk failure, ensuring search performance comparable to the warm tier while reducing disk cost.

The frozen tier allows direct searching of data stored in S3 or an object store, without the need for rehydration. The purpose is to further reduce storage costs for Prometheus metrics data that is less frequently accessed. By moving historical data into the frozen tier, organizations can optimize their storage infrastructure, ensuring that the recent, critical data remains in higher-performance tiers while less frequently accessed data is stored economically in the frozen tier. This way, organizations can perform historical analysis and trend detection, identify patterns and make informed decisions, and maintain compliance with regulatory standards in a cost-effective manner.

An alternative way to store your cloud-native metrics more efficiently is to use Elastic Time Series Data Stream (TSDS). TSDS can store your metrics data more efficiently with ~70% less disk space than a regular data stream. The downsampling functionality will further reduce the storage required by rolling up metrics within a fixed time interval into a single summary metric. This not only assists organizations in cutting down on storage expenses for metric data but also simplifies the metric infrastructure, making it easier for users to correlate metrics with logs and traces through a unified interface.

Advanced analytics

Besides Metrics Explorer and Discover, Elasticsearch® provides more advanced analytics capabilities and empowers organizations to gain deeper, more valuable insights into their Prometheus metrics data.

Out of the box, Prometheus integration provides a default overview dashboard.

From Metrics Explorer or Discover, users can also easily edit their Prometheus metrics visualization in Elastic Lens or create new visualizations from Lens.

Elastic Lens enables users to explore and visualize data intuitively through dynamic visualizations. This user-friendly interface eliminates the need for complex query languages, making data analysis accessible to a broader audience. Elasticsearch also offers other powerful visualization methods with aggregations and filters, enabling users to perform advanced analytics on their Prometheus metrics data, including short-term and historical data. To learn more, check out the how-to series: Kibana.

Anomaly detection and forecasting

When analyzing data, maintaining a constant watch on the screen is simply not feasible, especially when dealing with millions of time series of Prometheus metrics. Engineers frequently encounter the challenge of differentiating normal from abnormal data points, which involves analyzing historical data patterns — a process that can be exceedingly time consuming and often exceeds human capabilities. Thus, there is a pressing need for a more intelligent approach to detect anomalies efficiently.

Setting up alerts may seem like an obvious solution, but relying solely on rule-based alerts with static thresholds can be problematic. What's normal on a Wednesday at 9:00 a.m. might be entirely different from a Sunday at 2:00 a.m. This often leads to complex and hard-to-maintain rules or wide alert ranges that end up missing crucial issues. Moreover, as your business, infrastructure, users, and products evolve, these fixed rules don't keep up, resulting in lots of false positives or, even worse, important issues slipping through the cracks without detection. A more intelligent and adaptable approach is needed to ensure accurate and timely anomaly detection.

Elastic's machine learning anomaly detection excels in such scenarios. It automatically models the normal behavior of your Prometheus data, learning trends, and identifying anomalies, thereby reducing false positives and improving mean time to resolution (MTTR). With over 13 years of development experience in this field, Elastic has emerged as a trusted industry leader.

The key advantage of Elastic's machine learning anomaly detection lies in its unsupervised learning approach. By continuously observing real-time data, it acquires an understanding of the data's behavior over time. This includes grasping daily and weekly patterns, enabling it to establish a normalcy range of expected behavior. Behind the scenes, it constructs statistical models that allow accurate predictions, promptly identifying any unexpected variations. In cases where emerging data exhibits unusual trends, you can seamlessly integrate with alerting systems, operationalizing this valuable insight.

Machine learning's ability to project into the future, forecasting data trends one day, a week, or even a month ahead, equips engineers not only with reporting capabilities but also with pattern recognition and failure prediction based on historical Prometheus data. This plays a crucial role in maintaining mission-critical workloads, offering organizations a proactive monitoring approach. By foreseeing and addressing issues before they escalate, organizations can avert downtime, cut costs, optimize resource utilization, and ensure uninterrupted availability of their vital applications and services.

Creating a machine learning job for your Prometheus data is a straightforward task with a few simple steps. Simply specify the data index and set the desired time range in the single metric view. The machine learning job will then automatically process the historical data, building statistical models behind the scenes. These models will enable the system to predict trends and identify anomalies effectively, providing valuable and actionable insights for your monitoring needs.

In essence, Elastic machine learning empowers us to harness the capabilities of data scientists and effectively apply them in monitoring Prometheus metrics. By seamlessly detecting anomalies and predicting potential issues in advance, Elastic machine learning bridges the gap and enables IT professionals to benefit from the insights derived from advanced data analysis. This practical and accessible approach to anomaly detection equips organizations with a proactive stance toward maintaining the reliability of their systems.

Try it out

Start a free trial on Elastic Cloud and ingest your Prometheus metrics into Elastic. Enhance your Prometheus monitoring with Elastic Observability. Stay ahead of potential issues with advanced AI/ML anomaly detection and prediction capabilities. Eliminate data silos, reduce costs, and enhance overall response efficiency.

Elevate your monitoring capabilities with Elastic today!

Dynamic workload discovery on Kubernetes now supported with EDOT Collector

Tue, 01 Apr 2025 00:00:00 GMT

At Elastic, Kubernetes is one of the most significant observability use cases we focus on. We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.

OpenTelemetry recently published a blog on how to do Autodiscovery based on Kubernetes Pods' annotations with the OpenTelemetry Collector.

In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector, which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.

In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability. You might already have seen us focusing on:

Semantic Conventions standardization
significant log collection improvements
various other topics around instrumentation
profiling

Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.

Configuring EDOT Collector

The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal, letting workloads define how they should be monitored.

To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:

receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

You can include the above in the EDOT’s Collector configuration, specifically the receivers’ section.

Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver configuration block is removed and its preset is disabled (i.e. set to false) to avoid having log duplication.

Make sure that the receiver creator is properly added in the pipelines for logs (in addition to removing the filelog receiver completely) and metrics respectively.

Ensure that k8sobserver is enabled as part of the extensions:

extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]

Last but not least, ensure the log files' volume is mounted properly:

volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods

Once the configuration is ready follow the Kubernetes quickstart guides on how to deploy the EDOT Collector. Make sure to replace the values.yaml file linked in the quickstart guide with the file that includes the above-described modifications.

Collecting Metrics from Moving Targets Based on Their Annotations

In this example, we have a Deployment with a Pod spec that consists of two different containers. One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide different hints for each of these target containers.

The annotation-based discovery feature supports this, allowing us to specify metrics annotations per exposed container port.

Here is how the complete spec file looks:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: "true"
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: "http://`endpoint`/nginx_status"
          collection_interval: "30s"
          timeout: "20s"
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP

When this workload is deployed, the Collector will automatically discover it and identify the specific annotations. After this, two different receivers will be started, each one responsible for each of the target containers.

Collecting Logs from Multiple Target Containers

The annotation-based discovery feature also supports log collection based on the provided annotations. In the example below, we again have a Deployment with a Pod consisting of two different containers, where we want to apply different log collection configurations. We can specify annotations that are scoped to individual container names:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: "true"
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from busybox at $(date +%H:%M:%S)" && sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from lazybox at $(date +%H:%M:%S)" && sleep 25s; done

The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration. This is handy when we know how to parse specific technology logs, such as Apache server access logs.

Combining Both Metrics and Logs Collection

In our third example, we illustrate how to define both metrics and log annotations on the same workload. This allows us to collect both signals from the discovered workload. Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing. We can target annotations to the port and container levels to collect metrics from the Redis server using the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs at $(date +%H:%M:%S)" && sleep 15s; done

Explore and analyse data coming from dynamic targets in Elastic

Once the target Pods are discovered and the Collector has started collecting telemetry data from them, we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as logs collected from the Busybox container. Here is how it looks like:

Summary

The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature — one we played a major role in developing.

For this, we leveraged our years of experience with similar features already supported in Metricbeat, Filebeat, and Elastic-Agent. This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific monitoring agents and the OpenTelemetry Collector — making it even better.

Interested in learning more? Visit the documentation and give it a try by following our EDOT quickstart guide.

Managing your Kubernetes cluster with Elastic Observability

Mon, 24 Oct 2022 00:00:00 GMT

As an operations engineer (SRE, IT manager, DevOps), you’re always struggling with how to manage technology and data sprawl. Kubernetes is becoming increasingly pervasive and a majority of these deployments will be in Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), or Azure Kubernetes Service (AKS). Some of you may be on a single cloud while others will have the added burden of managing clusters on multiple Kubernetes cloud services. In addition to cloud provider complexity, you also have to manage hundreds of deployed services generating more and more observability and telemetry data.

The day-to-day operations of understanding the status and health of your Kubernetes clusters and applications running on them, through the logs, metrics, and traces they generate, will likely be your biggest challenge. But as an operations engineer you will need all of that important data to help prevent, predict, and remediate issues. And you certainly don’t need that volume of metrics, logs and traces spread across multiple tools when you need to visualize and analyze Kubernetes telemetry data for troubleshooting and support.

Elastic Observability helps manage the sprawl of Kubernetes metrics and logs by providing extensive and centralized observability capabilities beyond just the logging that we are known for. Elastic Observability provides you with granular insights and context into the behavior of your Kubernetes clusters along with the applications running on them by unifying all of your metrics, log, and trace data through OpenTelemetry and APM agents.

Regardless of the cluster location (EKS, GKE, AKS, self-managed) or application, Kubernetes monitoring is made simple with Elastic Observability. All of the node, pod, container, application, and infrastructure (AWS, GCP, Azure) metrics, infrastructure and application logs, along with application traces are available in Elastic Observability.

In this blog we will show:

How Elastic Cloud can aggregate and ingest metrics and log data through the Elastic Agent (easily deployed on your cluster as a DaemonSet) to retrieve logs and metrics from the host (system metrics, container stats) along with logs from all services running on top of Kubernetes.
How Elastic Observability can bring a unified telemetry experience (logs, metrics,traces) across all your Kubernetes cluster components (pods, nodes, services, namespaces, and more).

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
While we used GKE, you can use any location for your Kubernetes cluster.
We used a variant of the ever so popular HipsterShop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available such as the OpenTelemetry Demo App. To use the app, please go here and follow the instructions to deploy. You don’t need to deploy otelcollector for Kubernetes metrics to flow — we will cover this below.
Elastic supports native ingest from Prometheus and FluentD, but in this blog, we are showing a direct ingest from Kubernetes cluster via Elastic Agent. There will be a follow-up blog showing how Elastic can also pull in telemetry from Prometheus or FluentD/bit.

What can you observe and analyze with Elastic?

Before we walk through the steps on getting Elastic set up to ingest and visualize Kubernetes cluster metrics and logs, let’s take a sneak peek at Elastic’s helpful dashboards.

As we noted, we ran a variant of HipsterShop on GKE and deployed Elastic Agents with Kubernetes integration as a DaemonSet on the GKE cluster. Upon deployment of the agents, Elastic starts ingesting metrics from the Kubernetes cluster (specifically from kube-state-metrics) and additionally Elastic will pull all log information from the cluster.

Visualizing Kubernetes metrics on Elastic Observability

Here are a few Kubernetes dashboards that will be available out of the box (OOTB) on Elastic Observability.

In addition to the cluster overview dashboard and pod dashboard, Elastic has several useful OOTB dashboards:

Kubernetes overview dashboard (see above)
Kubernetes pod dashboard (see above)
Kubernetes nodes dashboard
Kubernetes deployments dashboard
Kubernetes DaemonSets dashboard
Kubernetes StatefulSets dashboards
Kubernetes CronJob & Jobs dashboards
Kubernetes services dashboards
More being added regularly

Additionally, you can either customize these dashboards or build out your own.

Working with logs on Elastic Observability

As you can see from the screens above, not only can I get Kubernetes cluster metrics, but also all the Kubernetes logs simply by using the Elastic Agent in my Kubernetes cluster.

Prevent, predict, and remediate issues

In addition to helping manage metrics and logs, Elastic can help you detect and predict anomalies across your cluster telemetry. Simply turn on Machine Learning in Elastic against your data and watch it help you enhance your analysis work. As you can see below, Elastic is not only a unified observability location for your Kubernetes cluster logs and metrics, but it also provides extensive true machine learning capabilities to enhance your analysis and management.

In the top graph, you see anomaly detection across logs and it shows something potentially wrong in the September 21 to 23 time period. Dig into the details on the bottom chart by analyzing a single kubernetes.pod.cpu.usage.node metric showing cpu issues early in September and again, later on in the month. You can do more complicated analyses on your cluster telemetry with Machine Learning using multi-metric analysis (versus the single metric issue I am showing above) along with population analysis.

Elastic gives you better machine learning capabilities to enhance your analysis of Kubernetes cluster telemetry. In the next section, let’s walk through how easy it is to get your telemetry data into Elastic.

Setting it all up

Let’s walk through the details of how to get metrics, logs, and traces into Elastic from a HipsterShop application deployed on GKE.

First, pick your favorite version of Hipstershop — as we noted above, we used a variant of the OpenTelemetry-Demo because it already has OTel. We slimmed it down for this blog, however (fewer services with some varied languages).

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Get a Kubernetes cluster and load your Kubernetes app into your cluster

Get your app on a Kubernetes cluster in your Cloud service of choice or local Kubernetes platform. Once your app is up on Kubernetes, you should have the following pods (or some variant) running on the default namespace.

NAME                                    READY   STATUS    RESTARTS   AGE
adservice-8694798b7b-jbfxt              1/1     Running   0          4d3h
cartservice-67b598697c-hfsxv            1/1     Running   0          4d3h
checkoutservice-994ddc4c4-p9p2s         1/1     Running   0          4d3h
currencyservice-574f65d7f8-zc4bn        1/1     Running   0          4d3h
emailservice-6db78645b5-ppmdk           1/1     Running   0          4d3h
frontend-5778bfc56d-jjfxg               1/1     Running   0          4d3h
jaeger-686c775fbd-7d45d                 1/1     Running   0          4d3h
loadgenerator-c8f76d8db-gvrp7           1/1     Running   0          4d3h
otelcollector-5b87f4f484-4wbwn          1/1     Running   0          4d3h
paymentservice-6888bb469c-nblqj         1/1     Running   0          4d3h
productcatalogservice-66478c4b4-ff5qm   1/1     Running   0          4d3h
recommendationservice-648978746-8bzxc   1/1     Running   0          4d3h
redis-cart-96d48485f-gpgxd              1/1     Running   0          4d3h
shippingservice-67fddb767f-cq97d        1/1     Running   0          4d3h

Step 2: Turn on kube-state-metrics

Next you will need to turn on kube-state-metrics.

First:

git clone https://github.com/kubernetes/kube-state-metrics.git

Next, in the kube-state-metrics directory under the examples directory, just apply the standard config.

kubectl apply -f ./standard

This will turn on kube-state-metrics, and you should see a pod similar to this running in kube-system namespace.

kube-state-metrics-5f9dc77c66-qjprz                    1/1     Running   0          4d4h

Step 3: Install the Elastic Agent with Kubernetes integration

Add Kubernetes Integration:

In Elastic, go to integrations and select the Kubernetes Integration, and select to Add Kubernetes.
Select a name for the Kubernetes integration.
Turn on kube-state-metrics in the configuration screen.
Give the configuration a name in the new-agent-policy-name text box.
Save the configuration. The integration with a policy is now created.

You can read up on the agent policies and how they are used on the Elastic Agent here.

Add Kubernetes integration.
Select the policy you just created in the second.
In the third step of Add Agent instructions, copy and paste or download the manifest.
Add manifest to the shell where you have kubectl running, save it as elastic-agent-managed-kubernetes.yaml, and run the following command.

kubectl apply -f elastic-agent-managed-kubernetes.yaml

You should see a number of agents come up as part of a DaemonSet in kube-system namespace.

NAME                                                   READY   STATUS    RESTARTS   AGE
elastic-agent-qr6hj                                    1/1     Running   0          4d7h
elastic-agent-sctmz                                    1/1     Running   0          4d7h
elastic-agent-x6zkw                                    1/1     Running   0          4d7h
elastic-agent-zc64h                                    1/1     Running   0          4d7h

In my cluster, I have four nodes and four elastic-agents started as part of the DaemonSet.

Step 4: Look at Elastic out of the box dashboards (OOTB) for Kubernetes metrics and start discovering Kubernetes logs

That is it. You should see metrics flowing into all the dashboards. To view logs for specific pods, simply go into Discover in Kibana and search for a specific pod name.

Additionally, you can browse all the pod logs directly in Elastic.

In the above example, I searched for frontendService and cartService logs.

Step 5: Bonus!

Because we were using an OTel based application, Elastic can even pull in the application traces. But that is a discussion for another blog.

Here is a quick peek at what Hipster Shop’s traces for a front end transaction look like in Elastic Observability.

Conclusion: Elastic Observability rocks for Kubernetes monitoring

I hope you’ve gotten an appreciation for how Elastic Observability can help you manage Kubernetes clusters along with the complexity of the metrics, log, and trace data it generates for even a simple deployment.

A quick recap of lessons and more specifically learned:

How Elastic Cloud can aggregate and ingest telemetry data through the Elastic Agent, which is easily deployed on your cluster as a DaemonSet and retrieves metrics from the host, such as system metrics, container stats, and metrics from all services running on top of Kubernetes
Show what Elastic brings from a unified telemetry experience (Kubernenetes logs, metrics, traces) across all your Kubernetes cluster components (pods, nodes, services, any namespace, and more).
Interest in exploring Elastic’s ML capabilities which will reduce your MTTHH (mean time to happy hour)

Ready to get started? Register and try out the features and capabilities I’ve outlined above.

Gain insights into Kubernetes errors with Elastic Observability logs and OpenAI

Thu, 18 May 2023 00:00:00 GMT

As we’ve shown in previous blogs, Elastic^® provides a way to ingest and manage telemetry from the Kubernetes cluster and the application running on it. Elastic provides out-of-the-box dashboards to help with tracking metrics, log management and analytics, APM functionality (which also supports native OpenTelemetry), and the ability to analyze everything with AIOps features and machine learning (ML). While you can use pre-existing ML models in Elastic, out-of-the-box AIOps features, or your own ML models, there is a need to dig deeper into the root cause of an issue.

Elastic helps reduce the operational work to support more efficient operations, but users still need a way to investigate and understand everything from the cause of an issue to the meaning of specific error messages. As an operations user, if you haven’t run into a particular error before or it's part of some runbook, you will likely go to Google and start searching for information.

OpenAI’s ChatGPT is becoming an interesting generative AI tool that helps provide more information using the models behind it. What if you could use OpenAI to obtain deeper insights (even simple semantics) for an error in your production or development environment? You can easily tie Elastic to OpenAI’s API to achieve this.

Kubernetes, a mainstay in most deployments (on-prem or in a cloud service provider) requires a significant amount of expertise — even if that expertise is to manage a service like GKE, EKS, or AKS.

In this blog, I will cover how you can use Elastic’s watcher capability to connect Elastic to OpenAI and ask it for more information about the error logs Elastic is ingesting from a Kubernetes cluster(s). More specifically, we will use Azure’s OpenAI Service. Azure OpenAI is a partnership between Microsoft and OpenAI, so the same models from OpenAI are available in the Microsoft version.

While this blog goes over a specific example, it can be modified for other types of errors Elastic receives in logs. Whether it's from AWS, the application, databases, etc., the configuration and script described in this blog can be modified easily.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
We used a GCP GKE Kubernetes cluster, but you can use any Kubernetes cluster service (on-prem or cloud based) of your choice.
We’re also running with a version of the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are here.
We also have an Azure account and Azure OpenAI service configured. You will need to get the appropriate tokens from Azure and the proper URL endpoint from Azure’s OpenAI service.
We will use Elastic’s dev tools, the console to be specific, to load up and run the script, which is an Elastic watcher.
We will also add a new index to store the results from the OpenAI query.

Here is the configuration we will set up in this blog:

As we walk through the setup, we’ll also provide the alternative setup with OpenAI versus Azure OpenAI Service.

Setting it all up

Over the next few steps, I’ll walk through:

Getting an account on Elastic Cloud and setting up your K8S cluster and application
Gaining Azure OpenAI authorization (alternative option with OpenAI)
Identifying Kubernetes error logs
Configuring the watcher with the right script
Comparing the output from Azure OpenAI/OpenAI versus ChatGPT UI

Step 0: Create an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Once you have the Elastic Cloud login, set up your Kubernetes cluster and application. A complete step-by-step instructions blog is available here. This also provides an overview of how to see Kubernetes cluster metrics in Elastic and how to monitor them with dashboards.

Step 1: Azure OpenAI Service and authorization

When you log in to your Azure subscription and set up an instance of Azure OpenAI Service, you will be able to get your keys under Manage Keys.

There are two keys for your OpenAI instance, but you only need KEY 1 .

Additionally, you will need to get the service URL. See the image above with our service URL blanked out to understand where to get the KEY 1 and URL.

If you are not using Azure OpenAI Service and the standard OpenAI service, then you can get your keys at:

**https** ://platform.openai.com/account/api-keys

You will need to create a key and save it. Once you have the key, you can go to Step 2.

Step 2: Identifying Kubernetes errors in Elastic logs

As your Kubernetes cluster is running, Elastic’s Kubernetes integration running on the Elastic agent daemon set on your cluster is sending logs and metrics to Elastic. The telemetry is ingested, processed, and indexed. Kubernetes logs are stored in an index called .ds-logs-kubernetes.container_logs-default-* (* is for the date), and an automatic data stream logs-kubernetes.container_logs is also pre-loaded. So while you can use some of the out-of-the-box dashboards to investigate the metrics, you can also look at all the logs in Elastic Discover.

While any error from Kubernetes can be daunting, the more nuanced issues occur with errors from the pods running in the kube-system namespace. Take the pod konnectivity agent, which is essentially a network proxy agent running on the node to help establish tunnels and is a vital component in Kubernetes. Any error will cause the cluster to have connectivity issues and lead to a cascade of issues, so it’s important to understand and troubleshoot these errors.

When we filter out for error logs from the konnectivity agent, we see a good number of errors.

But unfortunately, we still can’t understand what these errors mean.

Enter OpenAI to help us understand the issue better. Generally, you would take the error message from Discover and paste it with a question in ChatGPT (or run a Google search on the message).

One error in particular that we’ve run into but do not understand is:

E0510 02:51:47.138292       1 client.go:388] could not read stream err=rpc error: code = Unavailable desc = error reading from server: read tcp 10.120.0.8:46156->35.230.74.219:8132: read: connection timed out serverID=632d489f-9306-4851-b96b-9204b48f5587 agentID=e305f823-5b03-47d3-a898-70031d9f4768

The OpenAI output is as follows:

ChatGPT has given us a fairly nice set of ideas on why this rpc error is occurring against our konnectivity-agent.

So how can we get this output automatically for any error when those errors occur?

Step 3: Configuring the watcher with the right script

What is an Elastic watcher? Watcher is an Elasticsearch feature that you can use to create actions based on conditions, which are periodically evaluated using queries on your data. Watchers are helpful for analyzing mission-critical and business-critical streaming data. For example, you might watch application logs for errors causing larger operational issues.

Once a watcher is configured, it can be:

Manually triggered
Run periodically
Created using a UI or a script

In this scenario, we will use a script, as we can modify it easily and run it as needed.

We’re using the DevTools Console to enter the script and test it out:

The script is listed at the end of the blog in the appendix. It can also be downloaded here .

The script does the following:

It runs continuously every five minutes.
It will search the logs for errors from the container konnectivity-agent.
It will take the first error’s message, transform it (re-format and clean up), and place it into a variable first_hit.

"script": "return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\"', \"\")]"

The error message is sent into OpenAI with a query:

What are the potential reasons for the following kubernetes error:
  { { ctx.payload.second.first_hit } }

If the search yielded an error, it will proceed to then create an index and place the error message, pod.name (which is konnectivity-agent-6676d5695b-ccsmx in our setup), and OpenAI output into a new index called chatgpt_k8_analyzed.

To see the results, we created a new data view called chatgpt_k8_analyzed against the newly created index:

In Discover, the output on the data view provides us with the analysis of the errors.

For every error the script sees in the five minute interval, it will get an analysis of the error. We could alternatively also use a range as needed to analyze during a specific time frame. The script would just need to be modified accordingly.

Step 4. Output from Azure OpenAI/OpenAI vs. ChatGPT UI

As you noticed above, we got relatively the same result from the Azure OpenAI API call as we did by testing out our query in the ChatGPT UI. This is because we configured the API call to run the same/similar model as what was selected in the UI.

For the API call, we used the following parameters:

"request": {
             "method" : "POST",
             "Url": "https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview",
             "headers": {"api-key" : "XXXXXXX",
                         "content-type" : "application/json"
                        },
             "body" : "{ \"messages\": [ { \"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, { \"role\": \"user\", \"content\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\"}], \"temperature\": 0.5, \"max_tokens\": 2048}" ,
              "connection_timeout": "60s",
               "read_timeout": "60s"
                            }

By setting the role: system with You are a helpful assistant and using the gpt-35-turbo url portion, we are essentially setting the API to use the davinci model, which is the same as the ChatGPT UI model set by default.

Additionally, for Azure OpenAI Service, you will need to set the URL to something similar the following:

https://YOURSERVICENAME.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview

If you use OpenAI (versus Azure OpenAI Service), the request call (against https://api.openai.com/v1/completions) would be as such:

"request": {
            "scheme": "https",
            "host": "api.openai.com",
            "port": 443,
            "method": "post",
            "path": "\/v1\/completions",
            "params": {},
            "headers": {
               "content-type": "application\/json",
               "authorization": "Bearer YOUR_ACCESS_TOKEN"
                        },
            "body": "{ \"model\": \"text-davinci-003\",  \"prompt\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\",  \"temperature\": 1,  \"max_tokens\": 512,     \"top_p\": 1.0,      \"frequency_penalty\": 0.0,   \"presence_penalty\": 0.0 }",
            "connection_timeout_in_millis": 60000,
            "read_timeout_millis": 60000
          }

If you are interested in creating a more OpenAI-based version, you can download an alternative script and look at another blog from an Elastic community member.

Gaining other insights beyond Kubernetes logs

Now that the script is up and running, you can modify it using different:

Inputs
Conditions
Actions
Transforms

Learn more on how to modify it here. Some examples of modifications could include:

Look for error logs from application components (e.g., cartService, frontEnd, from the OTel demo), cloud service providers (e.g., AWS/Azure/GCP logs), and even logs from components such as Kafka, databases, etc.
Vary the time frame from running continuously to running over a specific range.
Look for specific errors in the logs.
Query for analysis on a set of errors at once versus just one, which we demonstrated.

The modifications are endless, and of course you can run this with OpenAI rather than Azure OpenAI Service.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you connect to OpenAI services (Azure OpenAI, as we showed, or even OpenAI) to better analyze an error log message instead of having to run several Google searches and hunt for possible insights.

Here’s a quick recap of what we covered:

Developing an Elastic watcher script that can be used to find and send Kubernetes errors into OpenAI and insert them into a new index
Configuring Azure OpenAI Service or OpenAI with the right authorization and request parameters

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

Appendix

Watcher script

PUT _watcher/watch/chatgpt_analysis
{
    "trigger": {
      "schedule": {
        "interval": "5m"
      }
    },
    "input": {
      "chain": {
          "inputs": [
              {
                  "first": {
                      "search": {
                          "request": {
                              "search_type": "query_then_fetch",
                              "indices": [
                                "logs-kubernetes*"
                              ],
                              "rest_total_hits_as_int": true,
                              "body": {
                                "query": {
                                  "bool": {
                                    "must": [
                                      {
                                        "match": {
                                          "kubernetes.container.name": "konnectivity-agent"
                                        }
                                      },
                                      {
                                        "match" : {
                                          "message":"error"
                                        }
                                      }
                                    ]
                                  }
                                },
                                "size": "1"
                              }
                            }
                        }
                    }
                },
                {
                    "second": {
                        "transform": {
                            "script": "return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\"', \"\")]"
                        }
                    }
                },
                {
                    "third": {
                        "http": {
                            "request": {
                                "method" : "POST",
                                "url": "https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview",
                                "headers": {
                                    "api-key" : "XXX",
                                    "content-type" : "application/json"
                                },
                                "body" : "{ \"messages\": [ { \"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, { \"role\": \"user\", \"content\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\"}], \"temperature\": 0.5, \"max_tokens\": 2048}" ,
                                "connection_timeout": "60s",
                                "read_timeout": "60s"
                            }
                        }
                    }
                }
            ]
        }
    },
    "condition": {
      "compare": {
        "ctx.payload.first.hits.total": {
          "gt": 0
        }
      }
    },
    "actions": {
        "index_payload" : {
            "transform": {
                "script": {
                    "source": """
                        def payload = [:];
                        payload.timestamp = new Date();
                        payload.pod_name = ctx.payload.first.hits.hits[0]._source.kubernetes.pod.name;
                        payload.error_message = ctx.payload.second.first_hit;
                        payload.chatgpt_analysis = ctx.payload.third.choices[0].message.content;
                        return payload;
                    """
                }
            },
            "index" : {
                "index" : "chatgpt_k8s_analyzed"
            }
        }
    }
}

Additional logging resources:

Common use case examples with logs:

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Screenshots of Microsoft products used with permission from Microsoft.

Native OpenTelemetry support in Elastic Observability

Wed, 13 Sep 2023 00:00:00 GMT

NOTE: Since writing this blog, new OTel data ingest configurations are now available in Elastic. See recent blog

OpenTelemetry is more than just becoming the open ingestion standard for observability. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability.

Elastic^® is strategically standardizing on OpenTelemetry for the main data collection architecture for observability and security. Additionally, Elastic is making a commitment to help OpenTelemetry become the best de facto data collection infrastructure for the observability ecosystem. Elastic is deepening its relationship with OpenTelemetry beyond the recent contribution of Elastic Common Schema (ECS) to OpenTelemetry (OTel).

Today, Elastic supports OpenTelemetry natively, since Elastic 7.14, by being able to directly ingest OpenTelemetry protocol (OTLP) based traces, metrics, and logs.

In this blog, we’ll review the current OpenTelemetry support provided by Elastic, which includes the following:

Easy ingest of distributed tracing and metrics for applications configured with OpenTelemetry agents for Python, NodeJS, Java, Go, and .NET
OpenTelemetry logs instrumentation and ingest using various configurations
Open semantic conventions for logs and more through ECS, which is not part of OpenTelemetry
Machine learning based AIOps capabilities, such as latency correlations, failure correlations, anomaly detection, log spike analysis, predictive pattern analysis, Elastic AI Assistant support, and more, all apply to native OTLP telemetry.
Migrate applications to OpenTelemetry at your own speed. Elastic’s APM capabilities all work seamlessly even with a mix of services using OpenTelemetry and/or Elastic APM agents. You can even combine OpenTelemetry instrumentation with Elastic Agent.
Integrated views and analysis with Kubernetes clusters, which most OpenTelemetry applications are running on. Elastic can highlight specific pods and containers related to each service when analyzing issues for applications based on OpenTelemetry.

Ingesting OpenTelemetry into Elastic

If you’re interested in seeing how simple it is to ingest OpenTelemetry traces and metrics into Elastic, follow the steps outlined in this blog.

Let’s outline what Elastic provides for ingesting OpenTelemetry data. Here are all your options:

Using the OpenTelemetry Collector

When using the OpenTelemetry Collector, which is the most common configuration option, you simply have to add two key variables.

The instructions utilize a specific opentelemetry-collector configuration for Elastic. Essentially, the Elastic values.yaml file specified in the elastic/opentelemetry-demo configure the opentelemetry-collector to point to the Elastic APM Server using two main values:

OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization

These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations->APM) in your Elastic Cloud.

Native OpenTelemetry agents embedded in code

If you are thinking of using OpenTelemetry libraries in your code, you can simply point the service to Elastic’s APM server, because it supports native OLTP protocol. No special Elastic conversion is needed.

To demonstrate this effectively and provide some education on how to use OpenTelemetry, we have two applications you can use to learn from:

Elastic’s version of OpenTelemetry demo: As with all the other observability vendors, we have our own forked version of the OpenTelemetry demo.
Elastiflix: This demo application is an example to help you learn how to instrument on various languages and telemetry signals.

Check out our blogs on using the Elastiflix application and instrumenting with OpenTelemetry:

Elastiflix application, a guide to instrument different languages with OpenTelemetry
Python: Auto-instrumentation, Manual-instrumentation
Java: Auto-instrumentation, Manual-instrumentation
Node.js: Auto-instrumentation, Manual-instrumentation
.NET: Auto-instrumentation, Manual-instrumentation

We have created YouTube videos on these topics as well:

Given Elastic and OpenTelemetry’s vast user base, these provide a rich source of education for anyone trying to learn the intricacies of instrumenting with OpenTelemetry.

Elastic Agents supporting OpenTelemetry

If you’ve already implemented OpenTelemetry, you can still use them with OpenTelemetry. Elastic APM agents today are able to ship OpenTelemetry spans as part of a trace. This means that if you have any component in your application that emits an OpenTelemetry span, it’ll be part of the trace the Elastic APM agent captures.

OpenTelemetry logs in Elastic

If you look at OpenTelemetry documentation, you will see that a lot of language libraries are still in experimental or not implemented yet state. Java is in stable state, per the documentation. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.

In a previous blog, we discussed 3 different configurations to properly get logging data into Elastic for Java. The blog explores the current state of the art of OpenTelemetry logging and provides guidance on the available approaches with the following tenants in mind:

Correlation of service logs with OTel-generated tracing where applicable
Proper capture of exceptions
Common context across tracing, metrics, and logging
Support for slf4j key-value pairs (“structured logging”)
Automatic attachment of metadata carried between services via OTel baggage
Use of an Elastic Observability backend
Consistent data fidelity in Elastic regardless of the approach taken

Three models, which are covered in the blog, currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol
Write logs from your service to a file scrapped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol
Write logs from your service to a file scrapped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol

Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.

OpenTelemetry is Elastic’s preferred schema

Elastic recently contributed the Elastic Common Schema (ECS) to the OpenTelemetry (OTel) project, enabling a unified data specification for security and observability data within the OTel Semantic Conventions framework.

ECS, an open source specification, was developed with support from the Elastic user community to define a common set of fields to be used when storing event data in Elasticsearch^®. ECS helps reduce management and storage costs stemming from data duplication, improving operational efficiency.

Similarly, OTel’s Semantic Conventions (SemConv) also specify common names for various kinds of operations and data. The benefit of using OTel SemConv is in following a common naming scheme that can be standardized across a codebase, libraries, and platforms for OTel users.

The merging of ECS and OTel SemConv will help advance OTel’s adoption and the continued evolution and convergence of observability and security domains.

Elastic Observability APM and machine learning capabilities

All of Elastic Observability’s APM capabilities are available with OTel data (read more on this in our blog, Independence with OpenTelemetry):

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services
Transactions (traces)
ML correlations (specifically for latency)
Service logs

In addition to Elastic’s APM and unified view of the telemetry data, you will now be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR. Here are some of the ML based AIOps capabilities we have:

Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your OpenTelemetry data — learning trends, periodicity, and more.
Log categorization: Elastic also identifies patterns in your OpenTelemetry log events quickly, so that you can take action quicker.
High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes.
Log spike detector helps identify reasons for increases in OpenTelemetry log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view.
Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data.

Elastic allows you to migrate to OTel on your schedule

Although OpenTelemetry supports many programming languages, the status of its major functional components — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.

For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your full stack observability platform in mixed mode (Elastic agents with OpenTelemetry agents).

Here is a simple example:

The above shows a simple variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.

Hence you can migrate what you need to OpenTelemetry with Elastic as specific languages reach a stable state, and you can then continue your migration to OpenTelemetry agents.

Integrated Kubernetes and OpenTelemetry views in Elastic

Elastic manages your Kubernetes cluster using the Elastic Agent, and you can use it on your Kubernetes cluster where your OpenTelemetry application is running. Hence you can not only use OpenTelemetry for your application, but Elastic can also monitor the corresponding Kubernetes cluster.

There are two configurations for Kubernetes:

1. Simply deploying the Elastic Agent daemon set on the kubernetes cluster. We outline this out in the article entitled Managing your Kubernetes cluster with Elastic Observability. This would also push just the Kubernetes metrics and logs to Elastic.

2. Deploying the Elastic Agent with not only the Kubernetes Daemon set, but also Elastic’s APM integration, the Defend (Security) integration, and Network Packet capture integration to provide more comprehensive Kubernetes cluster observability. We outline this configuration in the following article Modern observability and security on Kubernetes with Elastic and OpenTelemetry.

Both OpenTelemetry visualization examples use the OpenTelemetry demo, and in Elastic, we tie the Kubernetes information with the application to provide you an ability to see Kubernetes information from your traces in APM. This provides a more integrated approach when troubleshooting.

Summary

In essence, Elastic's commitment goes beyond mere support for OpenTelemetry. We are dedicated to ensuring our customers not only adopt OpenTelemetry but thrive with it. Through our solutions, expertise, and resources, we aim to elevate the observability journey for every business, turning data into actionable insights that drive growth and innovation.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Don’t have an Elastic Cloud account yet? Sign up for Elastic Cloud and try out the instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.

Tracing, logs, and metrics for a RAG based Chatbot with Elastic Distributions of OpenTelemetry

Fri, 24 Jan 2025 00:00:00 GMT

As discussed in the following post, Elastic added instrumentation for OpenAI based applications in EDOT. The main application most commonly using LLMs is known as a Chatbot. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation RAG (Retrieval Augmented Generation). Elastics's sample RAG based Chatbot application, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch.

This app is also now insturmented with EDOT, and you can visualize the Chatbot's traces to OpenAI, as well as relevant logs, and metrics from the application. By running the app as instructed in the github repo with Docker you can see these traces on a local stack. But how about running it against serverless, Elastic cloud or even with Kubernetes?

In this blog we will walk through how to set up Elastic's RAG Based Chatbot application with Elastic cloud and Kubernetes.

Prerequisites:

In order to follow along, these few pre-requisites are needed

An Elastic Cloud account — sign up now, and become familiar with Elastic's OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17
Git clone the RAG based Chatbot application and go through the tutorial on how to bring it up and become more familiar and how to bring up the application using Docker.
An account on OpenAI with API keys
Kubernetes cluster to run the RAG based Chatbot app
The instructions in this blog are also found in observability-examples in github.

Application OpenTelemetry output in Elastic

Chatbot-rag-app

The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:

As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.

Traces, logs, and metrics from EDOT in Elastic

Once you have the application running on your K8s cluster or with Docker, and Elastic Cloud up and running you should see the following:

Logs:

In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns, which saves you time in analysis.

Traces:

In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.

When you look at traces, you will be able to see the chatbot interactions in the trace.

You will see the end to end http call
Individual calls to elasticsearch
Specific calls such as invoke actions, and calls to the LLM

You can also get individual details of the traces, and look at related logs, and metrics related to that trace,

Metrics:

In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.

Setting it all up with Docker

In order to properly set up the Chatbot-app on Docker with telemetry sent over to Elastic, a few things must be set up:

Git clone the chatbot-rag-app
Modify the env file as noted in the github README with the following exception:

Use your Elastic cloud's OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADER instead.

You can find these in the Elastic Cloud under integrations->APM

Envs for sending the OTel instrumentation you will need the following:

OTEL_EXPORTER_OTLP_ENDPOINT="https://123456789.apm.us-west-2.aws.cloud.es.io:443"
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer%20xxxxx"

Notice the %20 in the headers. This will be needed to account for the space in credentials.

Set the following to false - OTEL_SDK_DISABLED=false
Set the envs for LLMs

In this example we're using OpenAI, hence only three variables are needed.

LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini

Run the docker container as noted

docker compose up --build --force-recreate

Play with the app at localhost:4000
Then log into Elastic cloud and see the output as shown previously.

Run chatbot-rag-app on Kubernetes

In order to set this up, you can follow the following repo on Observability-examples which has the Kubernetes yaml files being used. These will also point to Elastic Cloud.

Set up the Kubernetes Cluster (we're using EKS)
Get the appropriate ENV variables:

Find the OTEL_EXPORTER_OTLP_ENDPOINT/HEADER variables as noted in the pervious for Docker.
Get your OpenAI Key
Elasticsearch URL, and username and password.

Follow the instructions in the following github repo in observability examples to run two Kubernetes yaml files.

Essentially you need only replace the secret variables in k8s-deployment.yaml, and run

kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml

The app needs to be running first, then we use the app to initialize Elasticsearch with indices for the app.

Init-index-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        workingDir: /app/api
        command: ["python3", "-m", "flask", "--app", "app", "create-index"]
        env:
        - name: FLASK_APP
          value: "app"
        - name: LLM_TYPE
          value: "openai"
        - name: CHAT_MODEL
          value: "gpt-4o-mini"
        - name: ES_INDEX
          value: "workplace-app-docs"
        - name: ES_INDEX_CHAT_HISTORY
          value: "workplace-app-docs-chat-history"
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4

k8s-deployment.yaml

apiVersion: v1
kind: Secret
metadata:
  name: chatbot-regular-secrets
type: Opaque
stringData:
  ELASTICSEARCH_URL: "https://yourelasticcloud.es.us-west-2.aws.found.io"
  ELASTICSEARCH_USER: "elastic"
  ELASTICSEARCH_PASSWORD: "elastic"
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer%20xxxx"
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://12345.apm.us-west-2.aws.cloud.es.io:443"
  OPENAI_API_KEY: "YYYYYYYY"

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatbot-regular
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chatbot-regular
  template:
    metadata:
      labels:
        app: chatbot-regular
    spec:
      containers:
      - name: chatbot-regular
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: "openai"
        - name: CHAT_MODEL
          value: "gpt-4o-mini"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "service.name=chatbot-regular,service.version=0.0.1,deployment.environment=dev"
        - name: OTEL_SDK_DISABLED
          value: "false"
        - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
          value: "true"
        - name: OTEL_EXPERIMENTAL_RESOURCE_DETECTORS
          value: "process_runtime,os,otel,telemetry_distro"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "http/protobuf"
        - name: OTEL_METRIC_EXPORT_INTERVAL
          value: "3000"
        - name: OTEL_BSP_SCHEDULE_DELAY
          value: "3000"
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"

---
apiVersion: v1
kind: Service
metadata:
  name: chatbot-regular-service
spec:
  selector:
    app: chatbot-regular
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer

Open App with LoadBalancer URL

Run the kubectl get services command and get the URL for the chatbot app

% kubectl get services
NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-regular-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h

Play with app and review telemetry in Elastic
Once you go to the URL, you should see all the screens we described earlier in the beginning of this blog.

Conclusion

With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel and Elastic’s EDOT gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this. Here are the other Tracing blogs:

App Observability with LLM (Tracing)-

LLM Observability -

Tracing a RAG based Chatbot with Elastic Distributions of OpenTelemetry and Langtrace

Thu, 06 Feb 2025 00:00:00 GMT

Most AI-driven applications are currently focusing around increasing the value an end user, such as an SRE gets from AI. The main use case is the creation of various chatbots. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation (Retrieval Augmented Generation). Elastic’s sample RAG based Chatbot application, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch. However, what about monitoring the application?

Elastic provides the ability to ingest OpenTelemetry data with native OTel SDKs, the off the shelf OTel collector, or even Elastic’s Distributions of OpenTelemetry (EDOT). EDOT enables you to bring in logs, metrics and traces for your GenAI application and for K8s. However you will also generally need libraries to help trace specific components in your application. In tracing GenAI applications you can pick from a large set of libraries.

OpenTelemetry OpenAI Instrumentation-v2 - allows tracing LLM requests and logging of messages made by the OpenAI Python API library. (note v2 is built by OpenTelemetry, the non v2 version is from a specific vendor and not OpenTelemetry)
OpenTelemetry VertexAI Instrumentation - allows tracing LLM requests and logging of messages made by the VertexAI Python API library
Langtrace - commercially available library which supports all LLMs in one library, and all traces are also OTel native.
Elastic’s EDOT - which recently added tracing. See blog.

As you can see OpenTelemetry is the defacto mechanism that is converging to collect and ingest. OpenTelemetry is growing its support for this but it is also early days.

In this blog, we will walk through how to, with minimal code, observe a RAG based chatbot application with tracing using Langtrace. We previously covered Langtrace in a blog to highlight tracing Langchain.

In this blog we used langtrace OpenAI, Amazon Bedrock, Cohere, and others in one library.

Pre-requisites:

In order to follow along, these few pre-requisites are needed

An Elastic Cloud account — sign up now, and become familiar with Elastic’s OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17

Git clone the RAG based Chatbot application and go through the tutorial on how to bring it up and become more familiar.

An account on your favorite LLM (OpenAI, AzureOpen AI, etc), with API keys

Be familiar with EDOT to understand how we bring in logs, metrics, and traces from the application through the OTel Collector

Kubernetes cluster - I’ll be using Amazon EKS

Look at Langtrace documentation also.

Application OpenTelemetry output in Elastic

Chatbot-rag-app

The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:

As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.

Traces, logs, and metrics from EDOT in Elastic

Once you have OTel Collector with EDOT configuration on your K8s cluster, and Elastic Cloud up and running you should see the following:

Logs:

In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns (saves you time in analysis), and view logs from K8s.

Traces:

In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.

When you look at traces, you will be able to see the chatbot interactions in the trace.

You will see the end to end http call
Individual calls to elasticsearch
Specific calls such as invoke actions, and calls to the LLM

You can also get individual details of the traces, and look at related logs, and metrics related to that trace,

Metrics:

In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.

Setting it all up

In order to properly set up the Chatbot-app on K8s with telemetry sent over to Elastic, a few things must be set up:

Git clone the chatbot-rag-app, and modify one of the python files.
Next create a docker container that can be used in Kubernetes. The Docker build here in the Chatbot-app is good to use.
Collect all needed env variables. In this example we are using OpenAI, but the files can be modified for any of the LLMs. Hence you will have to get a few environmental variables loaded into the cluster. In the github repo there is a env.example for docker. You can pick and chose what is needed or not needed and adjust appropriately in the K8s file below.
Set up your K8s Cluster, and then install the OpenTelemetry collector with the appropriate yaml file and credentials. This will help collect K8s cluster logs and metrics also.
Utilize the two yaml files listed below to ensure you can run it on Kubernetes.

Init-index-job.yaml - Initiates the index in elasticsearch with the local corporate information
k8s-deployment-chatbot-rag-app.yaml - initializes the application frontend and backend.

Open the app on the load balancer URL against the chatbot-app service in K8s
Go to Elasticsearch and look at Discover for logs, go to APM and look for your chatbot-app and review the traces, and finally.

Modify the code for tracing with Langtrace

Once you curl the app and untar, go to the chatbot-rag-app directory:

curl https://codeload.github.com/elastic/elasticsearch-labs/tar.gz/main | 
tar -xz --strip=2 elasticsearch-labs-main/example-apps/chatbot-rag-app
cd elasticsearch-labs-main/example-apps/chatbot-rag-app

Next open the app.py file in the api directory and add the following

from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

FlaskInstrumentor().instrument_app(app)

into the code:

import os
import sys
from uuid import uuid4

from chat import ask_question
from flask import Flask, Response, jsonify, request
from flask_cors import CORS

from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

app = Flask(__name__, static_folder="../frontend/build", static_url_path="/")
CORS(app)

FlaskInstrumentor().instrument_app(app)

@app.route("/")

See the items in BOLD which will add in the langtrace library, and the opentelemetry flask instrumentation. This combination will provide and end to end trace for the https call all the way down to the calls to Elasticsearch, and to OpenAI (or other LLMs).

Create the docker container

Use the Dockerfile that is in the chatbot-rag-app directory as is and add the following line:

RUN pip3 install --no-cache-dir langtrace-python-sdk

into the Dockerfile:

COPY requirements.txt ./requirements.txt
RUN pip3 install -r ./requirements.txt
RUN pip3 install --no-cache-dir langtrace-python-sdk
COPY api ./api
COPY data ./data

EXPOSE 4000

This enables the langtrace-python-sdk to be installed into the docker container so the langtrace libraries can be used properly.

Collecting the proper env variables:

First collect the env variables from Elastic:

Envs for index initialization in Elastic:


ELASTICSEARCH_URL=https://aws.us-west-2.aws.found.io
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=elastic

# The name of the Elasticsearch indexes
ES_INDEX=workplace-app-docs
ES_INDEX_CHAT_HISTORY=workplace-app-docs-chat-history

The ELASTICSEARCH_URL can be found in cloud.elastic.co when you bring up your instance. The user and password, you will need to setup in Elastic.

Envs for sending the OTel instrumentation you will need the following:

OTEL_EXPORTER_OTLP_ENDPOINT="https://123456789.apm.us-west-2.aws.cloud.es.io:443"
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer xxxxx"

These credentials are found in Elastic under APM integration and under OpenTelemetry

Envs for LLMs

In this example we’re using OpenAI, hence only three variables are needed.

LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini

All these variables will be needed in the Kubernetes yamls in the next step

Setup K8s cluster and load up OTel Collector with EDOT

This step is outlined in the following Blog. It’s a simple three step process.

This step will bring in all the K8s cluster logs and metrics and setup the OTel collector.

Setup secrets, initialize indices, and start the app

Now that the cluster is up, and you have your environmental variables, you will need to

Install and run the k8s-deployments.yaml with the variables
Initialize the index

Essentially run the following:

kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml

Here are the two yamls you should use. Also found here

k8s-deployment.yaml

apiVersion: v1
kind: Secret
metadata:
  name: genai-chatbot-langtrace-secrets
type: Opaque
stringData:
  OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Bearer%20xxxx"
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://1234567.apm.us-west-2.aws.cloud.es.io:443"
 ELASTICSEARCH_URL: "YOUR_ELASTIC_SEARCH_URL"
  ELASTICSEARCH_USER: "elastic"
  ELASTICSEARCH_PASSWORD: "elastic"
  OPENAI_API_KEY: "XXXXXXX"  

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-chatbot-langtrace
spec:
  replicas: 2
  selector:
    matchLabels:
      app: genai-chatbot-langtrace
  template:
    metadata:
      labels:
        app: genai-chatbot-langtrace
    spec:
      containers:
      - name: genai-chatbot-langtrace
        image:65765.amazonaws.com/genai-chatbot-langtrace2:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: "openai"
        - name: CHAT_MODEL
          value: "gpt-4o-mini"
        - name: OTEL_SDK_DISABLED
          value: "false"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "service.name=genai-chatbot-langtrace,service.version=0.0.1,deployment.environment=dev"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "http/protobuf"
        envFrom:
        - secretRef:
            name: genai-chatbot-langtrace-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"

---
apiVersion: v1
kind: Service
metadata:
  name: genai-chatbot-langtrace-service
spec:
  selector:
    app: genai-chatbot-langtrace
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer

Init-index-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
#update your image location for chatbot rag app
        image: your-image-location:latest
        workingDir: /app/api
        command: ["python3", "-m", "flask", "--app", "app", "create-index"]
        env:
        - name: FLASK_APP
          value: "app"
        - name: LLM_TYPE
          value: "openai"
        - name: CHAT_MODEL
          value: "gpt-4o-mini"
        - name: ES_INDEX
          value: "workplace-app-docs"
        - name: ES_INDEX_CHAT_HISTORY
          value: "workplace-app-docs-chat-history"
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4

Open App with LoadBalancer URL

Run the kubectl get services command and get the URL for the chatbot app

% kubectl get services
NAME                                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-langtrace-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h

Play with app and review telemetry in Elastic

Once you go to the URL, you should see all the screens we described earlier in the beginning of this blog.

Conclusion

With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel, Elastic’s EDOT and Langtrace gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this.

Here are the other Tracing blogs:

App Observability with LLM (Tracing)-

LLM Observability -

Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator

Tue, 16 Jan 2024 00:00:00 GMT

This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (ECS) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.

Why use OpenShift Logging Operator?

A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:

It is developed and supported by Red Hat
It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible
It can speed up developing life cycles with features like source to image
It uses enhanced security

In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the official Elasticsearch® Documentation). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.

Which logs are we going to collect?

In OpenShift Container Platform, we distinguish three broad categories of logs: audit, application, and infrastructure logs:

Audit logs describe the list of activities that affected the system by users, administrators, and other components.
Application logs are composed of the container logs of the pods running in non-reserved namespaces.
Infrastructure logs are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.

In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.

Getting started

To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.

Inside Elasticsearch

We first install the Kubernetes integration assets. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.

To format the logs received from the ClusterLogForwarder in ECS format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to Exported fields | Logging | OpenShift Container Platform 4.14. To get a list of exported fields of the Kubernetes integration, you can refer to Kubernetes fields | Filebeat Reference [8.11] | Elastic and Logs app fields | Elastic Observability [8.11]. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.

PUT _ingest/pipeline/openshift-2-ecs
{
  "processors": [
    {
      "rename": {
        "field": "kubernetes.pod_id",
        "target_field": "kubernetes.pod.uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_ip",
        "target_field": "kubernetes.pod.ip",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_name",
        "target_field": "kubernetes.pod.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_name",
        "target_field": "kubernetes.namespace",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_id",
        "target_field": "kubernetes.namespace_uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_id",
        "target_field": "container.id",
        "ignore_missing": true
      }
    },
    {
      "dissect": {
        "field": "container.id",
        "pattern": "%{container.runtime}://%{container.id}",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_image",
        "target_field": "container.image.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.container.image",
        "copy_from": "container.image.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "copy_from": "kubernetes.container_name",
        "field": "container.name",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_name",
        "target_field": "kubernetes.container.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "level",
        "target_field": "log.level",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "file",
        "target_field": "log.file.path",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "dissect": {
        "field": "kubernetes.pod_owner",
        "pattern": "%{_tmp.parent_type}/%{_tmp.parent_name}",
        "ignore_missing": true
      }
    },
    {
      "lowercase": {
        "field": "_tmp.parent_type",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.pod.{{_tmp.parent_type}}.name",
        "value": "{{_tmp.parent_name}}",
        "if": "ctx?._tmp?.parent_type != null",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": [
          "_tmp",
          "kubernetes.pod_owner"
          ],
          "ignore_missing": true
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes annotations",
        "if": "ctx?.kubernetes?.annotations != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes namespace_labels",
        "if": "ctx?.kubernetes?.namespace_labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version",
        "if": "ctx?.kubernetes?.labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith("app_kubernetes_io_component_")) {
            def sanitizedKey = k.replace("app_kubernetes_io_component_", "app_kubernetes_io_component/");
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        """
      }
    }
    ]
}

Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:

PUT _ingest/pipeline/openshift-audit-2-ecs
{
  "processors": [
    {
      "script": {
        "source": """
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 && !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=["audit":audit];
        """,
        "description": "Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object"
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "script": {
        "if": "ctx?.kubernetes?.audit?.annotations != null",
        "source": """
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(".") >= 0) {
              def sanitizedKey = k.replace(".", "_");
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          """,
        "description": "Normalize kubernetes audit annotations field as expected by the Integration"
      }
    }
  ]
}

The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.

We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11].

The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog Simplifying log data management: Harness the power of flexible routing with Elastic and our documentation Reroute processor | Elasticsearch Guide [8.11] | Elastic.

In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:

PUT _ingest/pipeline/app-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.container_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.container_logs-openshift"
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-audit-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.audit_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.audit_logs-openshift"
      }
    }
  ]
}

Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the data_stream fields for us. Note that this step is done automatically by Elastic Agent at source.

Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.

PUT app-write
{
  "settings": {
      "index.default_pipeline": "app-write-reroute-pipeline"
   }
}


PUT audit-write
{
  "settings": {
    "index.default_pipeline": "audit-write-reroute-pipeline"
  }
}

Basically, what we did can be summarized in this picture:

Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes integration documentation, which in turn can lead to the execution of an another integration pipeline.

Create a user for sending the logs

We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:

PUT _security/role/YOURROLE
{
    "cluster": [
      "monitor"
    ],
    "indices": [
      {
        "names": [
          "logs-*-*"
        ],
        "privileges": [
          "auto_configure",
          "create_doc"
        ],
        "allow_restricted_indices": false
      },
      {
        "names": [
          "app-write",
          "audit-write",
        ],
        "privileges": [
          "create_doc",
          "read"
        ],
        "allow_restricted_indices": false
      }
    ],
    "applications": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }

}



PUT _security/user/YOUR_USERNAME
{
  "password": "YOUR_PASSWORD",
  "roles": ["YOURROLE"]
}

On OpenShift

On the OpenShift Cluster, we need to follow the official documentation of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.

We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}

The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:

apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD

Finally, we create the ClusterLogForwarder resource:

kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: "https://YOUR_ELASTICSEARCH_URL:443"
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch

Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.

Result

Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:

In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do more.

Learn more about log monitoring with the Elastic Stack.

Have feedback on this blog? Share it here.

Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel

Wed, 01 Nov 2023 00:00:00 GMT

As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.

As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.

In Elastic 8.11, a technical preview is now available of Elastic’s new piped query language, ES|QL (Elasticsearch Query Language), which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.

Advantages of ES|QL for SREs

SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:

Improved operational efficiency: By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.
Enhanced analysis with insights: ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.
Reduced mean time to resolution: ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.

ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.

In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.

You can also check out our Elastic Observability ES|QL Demo, which walks through ES|QL functionality for Observability.

ES|QL with AI Assistant

As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.

Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.

My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS

The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:

load generator
front-end proxy
frontendservice
checkoutservice

With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.

Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through Elastic APM failure correlation , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.

ES|QL insightful and contextual analysis in Discover

Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.

We use the following query in ES|QL in Elastic Discover:

from metrics-* | where kubernetes.container.status.last_terminated_reason != "" and kubernetes.namespace == "default" | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count > 0

ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.

The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.

We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.

ES|QL easily found the average memory usage fairly high.

We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.

Actionable alerts with ES|QL

Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.

We modify the last query to find any service with high memory usage:

FROM metrics*
| WHERE @timestamp >= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage > .5

With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.

With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.

Make the most of your data with ES|QL

In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:

ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.
SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.
Actionable alerts can be easily created from a single ES|QL query, enhancing operations.

Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at https://ela.st/free-trial now in technical preview.

Elastic Observability Tour

The power of effective log management

Transforming Observability with the AI Assistant

ES|QL announcement blog

Independence with OpenTelemetry on Elastic

Tue, 15 Nov 2022 00:00:00 GMT

The drive for faster, more scalable services is on the rise. Our day-to-day lives depend on apps, from a food delivery app to have your favorite meal delivered, to your banking app to manage your accounts, to even apps to schedule doctor’s appointments. These apps need to be able to grow from not only a features standpoint but also in terms of user capacity. The scale and need for global reach drives increasing complexity for these high-demand cloud applications.

In order to keep pace with demand, most of these online apps and services (for example, mobile applications, web pages, SaaS) are moving to a distributed microservice-based architecture and Kubernetes. Once you’ve migrated your app to the cloud, how do you manage and monitor production, scale, and availability of the service? OpenTelemetry is quickly becoming the de facto standard for instrumentation and collecting application telemetry data for Kubernetes applications.

OpenTelemetry (OTel) is an open source project providing a collection of tools, APIs, and SDKs that can be used to generate, collect, and export telemetry data (metrics, logs, and traces) to understand software performance and behavior. OpenTelemetry recently became a CNCF incubating project and has a significant amount of growing community and vendor support.

While OTel provides a standard way to instrument applications with a standard telemetry format, it doesn’t provide any backend or analytics components. Hence using OTel libraries in applications, infrastructure, and user experience monitoring provides flexibility in choosing the appropriate observability tool of choice. There is no longer any vendor lock-in for application performance monitoring (APM).

Elastic Observability natively supports OpenTelemetry and its OpenTelemetry protocol (OTLP) to ingest traces, metrics, and logs. All of Elastic Observability’s APM capabilities are available with OTel data. Hence the following capabilities (and more) are available for OTel data:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services
Transactions (traces)
ML correlations (specifically for latency)
Service logs

Given its open source heritage, Elastic also supports other CNCF based projects, such as Prometheus, Fluentd, Fluent Bit, Istio, Kubernetes (K8S), and many more.

This blog will show:

How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into Elastic Cloud through a few easy steps
Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this data once it’s in Elastic

In follow-up blogs, we will detail how to use Elastic’s machine learning with OTel telemetry data, how to instrument OTel application metrics for specific languages, how we can support Prometheus ingest through the OTel collector, and more. Stay tuned!

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
We used the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are here.
Make sure you have kubectl and helm also installed locally.
Additionally, we are using an OTel manually instrumented version of the application. No OTel automatic instrumentation was used in this blog configuration.
Location of our clusters. While we used Google Kubernetes Engine (GKE), you can use any Kubernetes platform of your choice.
While Elastic can ingest telemetry directly from OTel instrumented services, we will focus on the more traditional deployment, which uses the OpenTelemetry Collector.
Prometheus and FluentD/Fluent Bit — traditionally used to pull all Kubernetes data — is not being used here versus Kubernetes Agents. Follow-up blogs will showcase this.

Here is the configuration we will get set up in this blog:

Setting it all up

Over the next few steps, I’ll walk through an Opentelemetry visualization:

Getting an account on Elastic Cloud
Bringing up a GKE cluster
Bringing up the application
Configuring Kubernetes OTel Collector configmap to point to Elastic Cloud
Using Elastic Observability APM with OTel data for improved visibility

Step 0: Create an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Bring up a K8S cluster

We used Google Kubernetes Engine (GKE), but you can use any Kubernetes platform of your choice.

There are no special requirements for Elastic to collect OpenTelemetry data from a Kubernetes cluster. Any normal Kubernetes cluster on GKE, EKS, AKS, or Kubernetes compliant cluster (self-deployed and managed) works.

Step 2: Load the OpenTelemetry demo application on the cluster

Get your application on a Kubernetes cluster in your cloud service of choice or local Kubernetes platform. The application I am using is available here.

First clone the directory locally:

git clone https://github.com/elastic/opentelemetry-demo.git

(Make sure you have kubectl and helm also installed locally.)

OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization

These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations->APM) in your Elastic cloud.

Once you obtain this, the first step is to create a secret key on the cluster with your Elastic APM server endpoint, and your APM Secret Token with the following instruction:

kubectl create secret generic elastic-secret \
  --from-literal=elastic_apm_endpoint='YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX' \
  --from-literal=elastic_apm_secret_token='YOUR_APM_SECRET_TOKEN'

Don't forget to replace:

YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX: your Elastic APM endpoint ( without https:// prefix ) with OTEL_EXPORTER_OTLP_ENDPOINT
YOUR_APM_SECRET_TOKEN: your Elastic APM secret token OTEL_EXPORTER_OTLP_HEADERS

Now execute the following commands:

# switch to the kubernetes/elastic-helm directory
cd kubernetes/elastic-helm

# add the open-telemetry Helm repostiroy
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

# deploy the demo through helm install
helm install -f values.yaml my-otel-demo open-telemetry/opentelemetry-demo

Once your application is up on Kubernetes, you will have the following pods (or some variant) running on the default namespace.

kubectl get pods -n default

Output should be similar to the following:

NAME                                                  READY   STATUS    RESTARTS      AGE
my-otel-demo-accountingservice-5c77754b4f-vwph6       1/1     Running   0             5d4h
my-otel-demo-adservice-6b8b7c7dc5-mb7j5               1/1     Running   0             5d4h
my-otel-demo-cartservice-76d94b7dcd-2g4lf             1/1     Running   0             5d4h
my-otel-demo-checkoutservice-988bbdb88-hmkrp          1/1     Running   0             5d4h
my-otel-demo-currencyservice-6cf4b5f9f6-vz9t2         1/1     Running   0             5d4h
my-otel-demo-emailservice-868c98fd4b-lpr7n            1/1     Running   6 (18h ago)   5d4h
my-otel-demo-featureflagservice-8446ff9c94-lzd4w      1/1     Running   0             5d4h
my-otel-demo-ffspostgres-867945d9cf-zzwd7             1/1     Running   0             5d4h
my-otel-demo-frauddetectionservice-5c97c589b9-z8fhz   1/1     Running   0             5d4h
my-otel-demo-frontend-d85ccf677-zg9fp                 1/1     Running   0             5d4h
my-otel-demo-frontendproxy-6c5c4fccf6-qmldp           1/1     Running   0             5d4h
my-otel-demo-kafka-68bcc66794-dsbr6                   1/1     Running   0             5d4h
my-otel-demo-loadgenerator-64c545b974-xfccq           1/1     Running   1 (36h ago)   5d4h
my-otel-demo-otelcol-fdfd9c7cf-6lr2w                  1/1     Running   0             5d4h
my-otel-demo-paymentservice-7955c68859-ff7zg          1/1     Running   0             5d4h
my-otel-demo-productcatalogservice-67c879657b-wn2wj   1/1     Running   0             5d4h
my-otel-demo-quoteservice-748d754ffc-qcwm4            1/1     Running   0             5d4h
my-otel-demo-recommendationservice-df78894c7-lwm5v    1/1     Running   0             5d4h
my-otel-demo-redis-7d48567546-h4p4t                   1/1     Running   0             5d4h
my-otel-demo-shippingservice-f6fc76ddd-2v7qv          1/1     Running   0             5d4h

Step 3: Open Kibana and use the APM Service Map to view your OTel instrumented Services

In the Elastic Observability UI under APM, select servicemap to see your services.

If you are seeing this, then the OpenTelemetry Collector is sending data into Elastic:

Congratulations, you've instrumented the OpenTelemetry demo application using and successfully ingested the telemetry data into the Elastic!

Step 4: What can Elastic show me?

Now that the OpenTelemetry data is ingested into Elastic, what can you do?

First, you can view the APM service map (as shown in the previous step) — this will give you a full view of all the services and the transaction flows between services.

Next, you can now check out individual services and the transactions being collected.

As you can see, the frontend details are listed. Everything from:

Average service latency
Throughput
Main transactions
Failed traction rate
Errors
Dependencies

Let’s get to the trace. In the Transactions tab, you can review all the types of transactions related to the frontend service:

Selecting the HTTP POST transaction, we can see the full trace with all the spans:

Not only can you review the trace but you can also analyze what is related to higher than normal latency for HTTP POST .

Elastic uses machine learning to help identify any potential latency issues across the services from the trace. It’s as simple as selecting the Latency Correlations tab and running the correlation.

This shows that the high latency transactions are occurring in checkout service with a medium correlation.

You can then drill down into logs directly from the trace view and review the logs associated with the trace to help identify and pinpoint potential issues.

Analyze your data with Elastic machine learning (ML)

Once OpenTelemetry metrics are in Elastic, start analyzing your data through Elastic’s ML capabilities.

A great review of these features can be found here: Correlating APM telemetry to determine root causes in transactions. And there are many more videos and blogs on Elastic’s Blog. We’ll follow up with additional blogs on leveraging Elastic’s machine learning capabilities for OpenTelemetry data.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you ingest and analyze OpenTelemetry data with Elastic’s APM capabilities.

A quick recap of lessons and more specifically learned:

How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into Elastic Cloud, through a few easy steps
Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this once it’s in Elastic

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

Build better Service Level Objectives (SLOs) from logs and metrics

Fri, 23 Feb 2024 00:00:00 GMT

In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.

Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.

Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.

To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This feature enables setting measurable performance targets for services, such as availability, latency, traffic, errors, and saturation or define your own. Key components include:

Defining and monitoring SLIs (Service Level Indicators)
Monitoring error budgets indicating permissible performance shortfalls
Alerting on burn rates showing error budget consumption

Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.

Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.

In this blog, we will outline the following:

What are SLOs? A Google SRE perspective
Several scenarios of defining and managing SLOs

Service Level Objective overview

Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in Google's SRE Handbook. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:

Service Level Indicators (SLIs): These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.
Service Level Objective (SLO): An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is > .1%, the SLO will be out of compliance.
Error budget: This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.
Burn rate: This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.

Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to Google's SRE Handbook.

One main thing to remember is that SLO monitoring is not incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.

In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.

Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.
Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.
Manage all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.
Create unique service level dashboards with SLO information for a more comprehensive view of the service.

SREs need to be able to manage business metrics.

SLOs based on logs: NGINX availability

Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.

Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.

Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.

If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.

Additionally, the number of messages with http.response.status_code > 500 is minimal, like 17K.

Rather than creating an alert, we can create an SLO with this query:

We chose to use occurrences as the budgeting method to keep things simple.

Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.

Not only do we get information about the violation, but we also get:

Historical SLI (7 days)
Error budget burn down
Good vs. bad events (24 hours)

We can see how we’ve easily burned through our error budget.

Hence something must be going on with nginx. To investigate, all we need to do is utilize the AI Assistant, and use its natural language interface to ask questions to help analyze the situation.

Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.

As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.

However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.

AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.

While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:

99% of requests occur under 200ms
99% of log message are not errors

Application SLOs: OpenTelemetry demo cartservice

A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the OpenTelemetry demo.

This demo has feature flags to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.

Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.

We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.

We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.

As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.

We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.

Conclusion

SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:

SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.
AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.
APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.

For more information on SLOs in Elastic, check out Elastic documentation and the following resources:

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Enhancing SRE troubleshooting with the AI Assistant for Observability and your organization's runbooks

Wed, 08 Nov 2023 00:00:00 GMT

The Observability AI Assistant helps users explore and analyze observability data using a natural language interface, by leveraging automatic function calling to request, analyze, and visualize your data to transform it into actionable observability. The Assistant can also set up a Knowledge Base, powered by Elastic Learned Sparse EncodeR (ELSER) to provide additional context and recommendations from private data, alongside the large language models (LLMs) using RAG (Retrieval Augmented Generation). Elastic’s Stack — as a vector database with out-of-the-box semantic search and connectors to LLM integrations and the Observability solution — is the perfect toolkit to extract the maximum value of combining your company's unique observability knowledge with generative AI.

Enhanced troubleshooting for SREs

Site reliability engineers (SRE) in large organizations often face challenges in locating necessary information for troubleshooting alerts, monitoring systems, or deriving insights due to scattered and potentially outdated resources. This issue is particularly significant for less experienced SREs who may require assistance even with the presence of a runbook. Recurring incidents pose another problem, as the on-call individual may lack knowledge about previous resolutions and subsequent steps. Mature SRE teams often invest considerable time in system improvements to minimize "fire-fighting," utilizing extensive automation and documentation to support on-call personnel.

Elastic® addresses these challenges by combining generative AI models with relevant search results from your internal data using RAG. The Observability AI Assistant's internal Knowledge Base, powered by our semantic search retrieval model ELSER, can recall information at any point during a conversation, providing RAG responses based on internal knowledge.

This Knowledge Base can be enriched with your organization's information, such as runbooks, GitHub issues, internal documentation, and Slack messages, allowing the AI Assistant to provide specific assistance. The Assistant can also document and store specific information from an ongoing conversation with an SRE while troubleshooting issues, effectively creating runbooks for future reference. Furthermore, the Assistant can generate summaries of incidents, system status, runbooks, post-mortems, or public announcements.

This ability to retrieve, summarize, and present contextually relevant information is a game-changer for SRE teams, transforming the work from chasing documents and data to an intuitive, contextually sensitive user experience.The Knowledge Base (see requirements) serves as a central repository of Observability knowledge, breaking documentation silos and integrating tribal knowledge, making this information accessible to SREs enhanced with the power of LLMs.

Your LLM provider may collect query telemetry when using the AI Assistant. If your data is confidential or has sensitive details, we recommend you verify the data treatment policy of the LLM connector you provided to the AI Assistant.

In this blog post, we will cover different ways to enrich your Knowledge Base (KB) with internal information. We will focus on a specific alert, indicating that there was an increase in logs with “502 Bad Gateway” errors that has surpassed the alert’s threshold.

How to troubleshoot an alert with the Knowledge Base

Before the KB has been enriched with internal information, when the SRE asks the AI Assistant about how to troubleshoot an alert, the response from the LLM will be based on the data it learned during training; however, the LLM is not able to answer questions related to private, recent, or emerging knowledge. In this case, when asking for the steps to troubleshoot the alert, the response will be based on generic information.

However, once the KB has been enriched with your runbooks, when your team receives a new alert on “502 Bad Gateway” Errors, they can use AI Assistant to access the internal knowledge to troubleshoot it, using semantic search to find the appropriate runbook in the Knowledge Base.

In this blog, we will cover different ways to add internal information on how to troubleshoot an alert to the Knowledge Base:

Ask the assistant to remember the content of an existing runbook.
Ask the Assistant to summarize and store in the Knowledge Base the steps taken during a conversation and store it as a runbook.
Import your runbooks from GitHub or another external source to the Knowledge Base using our Connector and APIs.

After the runbooks have been added to the KB, the AI Assistant is now able to recall the internal and specific information in the runbooks. By leveraging the retrieved information, the LLM could provide more accurate and relevant recommendations for troubleshooting the alert. This could include suggesting potential causes for the alert, steps to resolve the issue, preventative measures for future incidents, or asking the assistant to help execute the steps mentioned in the runbook using functions. With more accurate and relevant information at hand, the SRE could potentially resolve the alert more quickly, reducing downtime and improving service reliability.

Your Knowledge Base documents will be stored in the indices .kibana-observability-ai-assistant-kb-*. Have in mind that LLMs have restrictions on the amount of information the model can read and write at once, called token limit. Imagine you're reading a book, but you can only remember a certain number of words at a time. Once you've reached that limit, you start to forget the earlier words you've read. That's similar to how a token limit works in an LLM.

To keep runbooks within the token limit for Retrieval Augmented Generation (RAG) models, ensure the information is concise and relevant. Use bullet points for clarity, avoid repetition, and use links for additional information. Regularly review and update the runbooks to remove outdated or irrelevant information. The goal is to provide clear, concise, and effective troubleshooting information without compromising the quality due to token limit constraints. LLMs are great for summarization, so you could ask the AI Assistant to help you make the runbooks more concise.

Ask the assistant to remember the content of an existing runbook

The easiest way to store a runbook into the Knowledge Base is to just ask the AI Assistant to do it! Open a new conversation and ask “Can you store this runbook in the KB for future reference?” followed by pasting the content of the runbook in plain text.

The AI Assistant will then store it in the Knowledge Base for you automatically, as simple as that.

Ask the Assistant to summarize and store the steps taken during a conversation in the Knowledge Base

You can also ask the AI Assistant to remember something while having a conversation — for example, after you have troubleshooted an alert using the AI Assistant, you could ask to "remember how to troubleshoot this alert for next time." The AI Assistant will create a summary of the steps taken to troubleshoot the alert and add it to the Knowledge Base, effectively creating runbooks for future reference. Next time you are faced with a similar situation, the AI Assistant will recall this information and use it to assist you.

In the following demo, the user asks the Assistant to remember the steps that have been followed to troubleshoot the root cause of an alert, and also to ping the Slack channel when this happens again. In a later conversation with the Assistant, the user asks what can be done about a similar problem, and the AI Assistant is able to remember the steps and also reminds the user to ping the Slack channel.

After receiving the alert, you can open the AI Assistant chat and test troubleshooting the alert. After investigating an alert, ask the AI Assistant to summarize the analysis and the steps taken to root cause. To remember them for the next time, we have a similar alert and add extra instruction like to warn the Slack channel.

The Assistant will use the built-in functions to summarize the steps and store them into your Knowledge Base, so they can be recalled in future conversations.

Open a new conversation, and ask what are the steps to take when troubleshooting a similar alert to the one we just investigated. The Assistant will be able to recall the information stored in the KB that is related to the specific alert, using semantic search based on ELSER, and provide a summary of the steps taken to troubleshoot it, including the last indication of informing the Slack channel.

Import your runbooks stored in GitHub to the Knowledge Base using APIs or our GitHub Connector

You can also add proprietary data into the Knowledge Base programmatically by ingesting it (e.g., GitHub Issues, Markdown files, Jira tickets, text files) into Elastic.

If your organization has created runbooks that are stored in Markdown documents in GitHub, follow the steps in the next section of this blog post to index the runbook documents into your Knowledge Base.

The steps to ingest documents into the Knowledge Base are the following:

Ingest your organization’s knowledge into Elasticsearch

Option 1: Use the Elastic web crawler . Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler, a search-optimized Elasticsearch® index is created to hold and sync webpage content.

Option 2: Use Elasticsearch's Index API . Watch tutorials that demonstrate how you can use the Elasticsearch language clients to ingest data from an application.

Option 3: Build your own connector. Follow the steps described in this blog: How to create customized connectors for Elasticsearch.

Option 4: Use Elasticsearch Workplace Search connectors . For example, the GitHub connector can automatically capture, sync, and index issues, Markdown files, pull requests, and repos.

Follow the steps to configure the GitHub Connector in GitHub to create an OAuth App from the GitHub platform.

Now you can connect a GitHub instance to your organization. Head to your organization’s Search > Workplace Search administrative dashboard, and locate the Sources tab.

Select GitHub (or GitHub Enterprise) in the Configured Sources list, and follow the GitHub authentication flow as presented. Upon the successful authentication flow, you will be redirected to Workplace Search and will be prompted to select the Organization you would like to synchronize.

After configuring the connector and selecting the organization, the content should be synchronized and you will be able to see it in Sources. If you don’t need to index all the available content, you can specify the indexing rules via the API. This will help shorten indexing times and limit the size of the index. See Customizing indexing.

The source has created an index in Elastic with the content (Issues, Markdown Files…) from your organization. You can find the index name by navigating to Stack Management > Index Management , activating the Include hidden Indices button on the right, and searching for “GitHub.”

You can explore the documents you have indexed by creating a Data View and exploring it in Discover. Go to Stack Management > Kibana > Data Views > Create data view and introduce the data view Name, Index pattern (make sure you activate “Allow hidden and system indices” in advanced options), and Timestamp field:

You can now explore the documents in Discover using the data view:

Reindex your internal runbooks into the AI Assistant’s Knowledge Base Index, using it's semantic search pipeline

Your Knowledge Base documents are stored in the indices .kibana-observability-ai-assistant-kb-*. To add your internal runbooks imported from GitHub to the KB, you just need to reindex the documents from the index you created in the previous step to the KB’s index. To add the semantic search capabilities to the documents in the KB, the reindex should also use the ELSER pipeline preconfigured for the KB, .kibana-observability-ai-assistant-kb-ingest-pipeline.

By creating a Data View with the KB index, you can explore the content in Discover.

You execute the query below in Management > Dev Tools , making sure to replace the following, both on “_source” and “inline”:

InternalDocsIndex : name of the index where your internal docs are stored
text_field : name of the field with the text of your internal docs
timestamp : name of the field of the timestamp in your internal docs
public : (true or false) if true, makes a document available to all users in the defined Kibana Space (if is defined) or in all spaces (if is not defined); if false, document will be restricted to the user indicated in
(optional) space : if defined, restricts the internal document to be available in a specific Kibana Space
(optional) user.name : if defined, restricts the internal document to be available for a specific user
(optional) "query" filter to index only certain docs (see below)

POST _reindex
{
    "source": {
        "index": "",
        "_source": [
            "",
            "",
            "namespace",
            "is_correction",
            "public",
            "confidence"
        ]
    },
    "dest": {
        "index": ".kibana-observability-ai-assistant-kb-000001",
        "pipeline": ".kibana-observability-ai-assistant-kb-ingest-pipeline"
    },
    "script": {
        "inline": "ctx._source.text=ctx._source.remove(\"\");ctx._source.namespace=\"\";ctx._source.is_correction=false;ctx._source.public=;ctx._source.confidence=\"high\";ctx._source['@timestamp']=ctx._source.remove(\"\");ctx._source['user.name'] = \"\""
    }
}

You may want to specify the type of documents that you reindex in the KB — for example, you may only want to reindex Markdown documents (like Runbooks). You can add a “query” filter to the documents in the source. In the case of GitHub, runbooks are identified with the “type” field containing the string “file,” and you could add that to the reindex query like indicated below. To add also GitHub Issues, you can also include in the query “type” field containing the string “issues”:

"source": {
        "index": "",
        "_source": [
            "",
            "",
            "namespace",
            "is_correction",
            "public",
            "confidence"
        ],
    "query": {
      "terms": {
        "type": ["file"]
      }
    }

Great! Now that the data is stored in your Knowledge Base, you can ask the Observability AI Assistant any questions about it:

Conclusion

In conclusion, leveraging internal Observability knowledge and adding it to the Elastic Knowledge Base can greatly enhance the capabilities of the AI Assistant. By manually inputting information or programmatically ingesting documents, SREs can create a central repository of knowledge accessible through the power of Elastic and LLMs. The AI Assistant can recall this information, assist with incidents, and provide tailored observability to specific contexts using Retrieval Augmented Generation. By following the steps outlined in this article, organizations can unlock the full potential of their Elastic AI Assistant.

Start enriching your Knowledge Base with the Elastic AI Assistant today and empower your SRE team with the tools they need to excel. Follow the steps outlined in this article and take your incident management and alert remediation processes to the next level. Your journey toward a more efficient and effective SRE operation begins now.

Using a custom agent with the OpenTelemetry Operator for Kubernetes

Tue, 16 Jul 2024 00:00:00 GMT

This is the second part of a two part series. The first part is available at Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications. In that first part I walk through setting up and installing the OpenTelemetry Operator for Kubernetes, and configuring that for auto-instrumentation of a Java application using the OpenTelemetry Java agent.

In this second part, I show how to install any Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.

Installation and configuration recap

Part 1 of this series, Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications, details the installation and configuration of the OpenTelemetry operator and an Instrumentation resource. Here is an outline of the steps as a reminder:

Install cert-manager, eg kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
Install the operator, eg kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
Create an Instrumentation resource
Add an annotation to either the deployment or the namespace
Deploy the application as normal

In that first part, steps 3, 4 & 5 were implemented for the OpenTelemetry Java agent. In this blog I’ll implement them for other agents, using the Elastic APM agents as examples. I assume that steps 1 & 2 outlined above have already been done, ie that the operator is now installed. I will continue using the banana namespace for the examples, so ensure that namespace exists (kubectl create namespace banana). As per part 1, if you use any of the example instrumentation definitions below, you’ll need to substitute my.apm.server.url and my-apm-secret-token with the values appropriate for your collector.

Using the Elastic Distribution for OpenTelemetry Java

From version 0.4.0, the Elastic Distribution for OpenTelemetry Java includes the agent jar at the path /javaagent.jar in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define, and as it’s a distribution of the OpenTelemetry Java agent, all the OpenTelemetry environment can apply:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-otel
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"
  java:
    image: docker.elastic.co/observability/elastic-otel-javaagent:1.8.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: "Authorization=Bearer my-apm-secret-token"
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: "true"
      - name: ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION
        value: "50"

I’ve included environment for switching on several features in the agent, including

ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation feature described in this blog
Span stack traces are automatically captured if the span takes more than ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION (default would be 5ms)

Adding in the annotation ...

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "elastic-otel"

... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans and stack traces

The additions from the features mentioned above are circled in red - inferred spans (for methodC and methodD) bottom left, and the stack trace top right. (Note that the pod included the OTEL_INSTRUMENTATION_METHODS_INCLUDE environment variable set to "test.Testing[methodB]" so that traces from methodB are shown; for pod configuration see the "Trying it" section in part 1)

Using the Elastic APM Java agent

From version 1.50.0, the Elastic APM Java agent includes the agent jar at the path /javaagent.jar in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-apm
  namespace: banana
spec:
  java:
    image: docker.elastic.co/observability/apm-agent-java:1.55.4
    env:
      - name: ELASTIC_APM_SERVER_URL
        value: "https://my.apm.server.url"
      - name: ELASTIC_APM_SECRET_TOKEN
        value: "my-apm-secret-token"
      - name: ELASTIC_APM_LOG_LEVEL
        value: "INFO"
      - name: ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED
        value: "true"
      - name: ELASTIC_APM_LOG_SENDING
        value: "true"

I’ve included environment for switching on several features in the agent, including

ELASTIC_APM_LOG_LEVEL set to the default value (INFO) which could easily be switched to DEBUG
ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation equivalent to the feature described in this blog
ELASTIC_APM_LOG_SENDING which switches on sending logs to the APM UI, the logs are automatically correlated with transactions (for all common logging frameworks)

Adding in the annotation ...

metadata:
  annotations:
     instrumentation.opentelemetry.io/inject-java: "elastic-apm"

... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans

(Note that the pod included the ELASTIC_APM_TRACE_METHODS environment variable set to "test.Testing#methodB" so that traces from methodB are shown; for pod configuration see the "Trying it" section in part 1)

Using an extension with the OpenTelemetry Java agent

Setting up an Instrumentation resource for the OpenTelemetry Java agent is straightforward and was done in part 1 of this two part series - and you can see from the above examples it’s just a matter of deciding on the docker image URL you want to use. However if you want to include an extension in your deployment, this is a little more complex, but also supported by the operator. Basically the extensions you want to include with the agent need to be in docker images - or you have to build an image which includes the extensions that are not already in images. Then you declare the images and the directories the extensions are in, in the Instrumentation resource. As an example, I’ll show an Instrumentation which uses version 2.5.0 of the OpenTelemetry Java agent together with the inferred spans extension from the Elastic OpenTelemetry Java distribution. The distro image includes the extension at path /extensions/elastic-otel-agentextension.jar. The Instrumentation resource allows either directories or file paths to be specified, here I’ll list the directory:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-plus-extension-instrumentation
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    extensions:
      - image: "docker.elastic.co/observability/elastic-otel-javaagent:1.8.0"
        dir: "/extensions"
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: "Authorization=Bearer my-apm-secret-token"
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: "true"

Note that you can have multiple image … dir pairs, ie include multiple extensions from different images. Note also if you are testing this specific configuration that the inferred spans extension included here will be contributed to the OpenTelemetry contrib repo at some point after this blog is published, after which the extension may no longer be present in a later version of the referred image (since it will be available from the contrib repo instead).

Next steps

Here I’ve shown how to use any agent with the OpenTelemetry Operator for Kubernetes, and configure that for your system. In particular the examples have showcased how to use the Elastic Java agents to auto-instrument Java applications running in your Kubernetes clusters, along with how to enable features, using Instrumentation resources. And you can set it up for either zero config for deployments, or for just one annotation which is generally a more flexible mechanism (you can have multiple Instrumentation resource definitions, and the deployment can select the appropriate one for its application).

Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications

Thu, 11 Jul 2024 00:00:00 GMT

The OpenTelemetry Java agent has a number of ways to install the agent into a Java application. If you are running your Java applications in Kubernetes pods, there is a separate mechanism (which under the hood uses JAVA_TOOL_OPTIONS and other environment variables) to auto-instrument Java applications. This auto-instrumentation can be achieved with zero configuration of the applications and pods!

The mechanism to achieve zero-config auto-instrumentation of Java applications in Kubernetes is via the OpenTelemetry Operator for Kubernetes. This operator has many capabilities and the full documentation (and of course source) is available in the project itself. In this blog, I'll walk through installing, setting up and running zero-config auto-instrumentation of Java applications in Kubernetes using the OpenTelemetry Operator.

Installing the OpenTelemetry Operator

At the time of writing this blog, the OpenTelemetry Operator needs the certification manager to be installed, after which the operator can be installed. Installing from the web is straightforward. First install the cert-manager (the version to be installed will be specified in the OpenTelemetry Operator for Kubernetes documentation):

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml

Then when the cert managers are ready (kubectl get pods -n cert-manager) ...

NAMESPACE      NAME                                         READY
cert-manager   cert-manager-67c98b89c8-rnr5s                1/1
cert-manager   cert-manager-cainjector-5c5695d979-q9hxz     1/1
cert-manager   cert-manager-webhook-7f9f8648b9-8gxgs        1/1

... you can install the OpenTelemetry Operator:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

You can, of course, use a specific version of the operator instead of the latest. But here I’ve used the latest version.

An Instrumentation resource

Now you need to add just one further Kubernetes resource to enable auto-instrumentation: an Instrumentation resource. I am going to use the banana namespace for my examples, so I have first created that namespace (kubectl create namespace banana). The auto-instrumentation is specified and configured by these Instrumentation resources. Here is a basic one which will allow every Java pod in the banana namespace to be auto-instrumented with version 2.5.0 of the OpenTelemetry Java agent:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: banana-instr
  namespace: banana
spec:
  exporter:
    endpoint: "https://my.endpoint"
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "1.0"
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: "Authorization=Bearer MyAuth"

Creating this resource (eg with kubectl apply -f banana-instr.yaml, assuming the above yaml was saved in file banana-instr.yaml) makes the banana-instr Instrumentation resource available for use. (Note you will need to change my.endpoint and MyAuth to values appropriate for your collector.) You can use this instrumentation immediately by adding an annotation to any deployment in the banana namespace:

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"

The banana-instr Instrumentation resource is not yet set to be applied by default to all pods in the banana namespace. Currently it's zero-config as far as the application is concerned, but it requires an annotation added to a pod or deployment. To make it fully zero-config for all pods in the banana namespace, we need to add that annotation to the namespace itself, ie editing the namespace (kubectl edit namespace banana) so it would then have contents similar to

apiVersion: v1
kind: Namespace
metadata:
  name: banana
  annotations:
    instrumentation.opentelemetry.io/inject-java: "banana-instr"
...

Now we have a namespace that is going to auto-instrument every Java application deployed in the banana namespace with the 2.5.0 OpenTelemetry Java agent!

Trying it

There is a simple example Java application at docker.elastic.co/demos/apm/k8s-webhook-test which just repeatedly calls the chain main->methodA->methodB->methodC->methodD with some sleeps in the calls. Running this (kubectl apply -f banana-app.yaml) using a very basic pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: banana-app
  namespace: banana
  labels:
    app: banana-app
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: banana-app
      env: 
      - name: OTEL_INSTRUMENTATION_METHODS_INCLUDE
        value: "test.Testing[methodB]"

results in the app being auto-instrumented with no configuration changes! The resulting app shows up in any APM UI, such as Elastic APM

As you can see, for this example I also added this env var to the pod yaml, OTEL_INSTRUMENTATION_METHODS_INCLUDE="test.Testing[methodB]" so that there were traces showing from methodB.

The technology behind the auto-instrumentation

To use the auto-instrumentation there is no specific need to understand the underlying mechanisms, but for those of you interested, here’s a quick outline.

The OpenTelemetry Operator for Kubernetes installs a mutating webhook, a standard Kubernetes component.
When deploying, Kubernetes first sends all definitions to the mutating webhook.
If the mutating webhook sees that the conditions for auto-instrumentation should be applied (ie
1. there is an Instrumentation resource for that namespace and
2. the correct annotation for that Instrumentation is applied to the definition in some way, either from the definition itself or from the namespace),
then the mutating webhook “mutates” the definition to include the environment defined by the Instrumentation resource.
The environment includes the explicit values defined in the env, as well as some implicit OpenTelemetry values (see the OpenTelemetry Operator for Kubernetes documentation for full details).
And most importantly, the operator
1. pulls the image defined in the Instrumentation resource,
2. extracts the file at the path /javaagent.jar from that image (using shell command cp)
3. inserts it into the pod at path /otel-auto-instrumentation-java/javaagent.jar
4. and adds the environment variable JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation-java/javaagent.jar.
The JVM automatically picks up that JAVA_TOOL_OPTIONS environment variable on startup and applies it to the JVM command-line.

Next steps

This walkthrough can be repeated in any Kubernetes cluster to demonstrate and experiment with auto-instrumentation (you will need to create the banana namespace first). In part 2 of this two part series, Using a custom agent with the OpenTelemetry Operator for Kubernetes, I show how to install any Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.