Elastic Observability Labs - Articles by Jonas Kunz

Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation

Thu, 28 Mar 2024 00:00:00 GMT

Observability goes beyond monitoring; it's about truly understanding your system. To achieve this comprehensive view, practitioners need a unified observability solution that natively combines insights from metrics, logs, traces, and crucially, continuous profiling. While metrics, logs, and traces offer valuable insights, they can't answer the all-important "why." Continuous profiling signals act as a magnifying glass, providing granular code visibility into the system's hidden complexities. They fill the gap left by other data sources, enabling you to answer critical questions –– why is this trace slow? Where exactly in the code is the bottleneck residing?

Traces provide the "what" and "where" — what happened and where in your system. Continuous profiling refines this understanding by pinpointing the "why" and validating your hypotheses about the "what." Just like a full-body MRI scan, Elastic's whole-system continuous profiling (powered by eBPF) uncovers unknown-unknowns in your system. This includes not just your code, but also third-party libraries and kernel activity triggered by your application transactions. This comprehensive visibility improves your mean-time-to-detection (MTTD) and mean-time-to-recovery (MTTR) KPIs.

[Related article: Why metrics, logs, and traces aren’t enough]

Bridging the disconnect between continuous profiling and OTel traces

Historically, continuous profiling signals have been largely disconnected from OpenTelemetry (OTel) traces. Here's the exciting news: we're bridging this gap! We're introducing native correlation between continuous profiling signals and OTel traces, starting with Java.

Imagine this: You're troubleshooting a performance issue and identify a slow trace. Whole-system continuous profiling steps in, acting like an MRI scan for your entire codebase and system. It narrows down the culprit to the specific lines of code hogging CPU time within the context of your distributed trace. This empowers you to answer the "why" question with minimal effort and confidence, all within the same troubleshooting context.

Furthermore, by correlating continuous profiling with distributed tracing, Elastic Observability customers can measure the cloud cost and CO₂ impact of every code change at the service and transaction level.

This milestone is significant, especially considering the recent developments in the OTel community. With OTel adopting profiling and Elastic donating the industry’s most advanced eBPF-based continuous profiling agent to OTel, we're set for a game-changer in observability — empowering OTel end users with a correlated system visibility that goes from a trace span in the userspace down to the kernel.

Furthermore, achieving this goal, especially with Java, presented significant challenges and demanded serious engineering R&D. This blog post will delve into these challenges, explore the approaches we considered in our proof-of-concepts, and explain how we arrived at a solution that can be easily extended to other OTel language agents. Most importantly, this solution correlates traces with profiling signals at the agent, not in the backend — to ensure optimal query performance and minimal reliance on vendor backend storage architectures.

Figuring out the active OTel trace and span

The primary technical challenge in this endeavor is essentially the following: whenever the profiler interrupts an OTel instrumented process to capture a stacktrace, we need to be able to efficiently determine the active span and trace ID (per-thread) and the service name (per-process).

For the purpose of this blog, we'll focus on the recently released Elastic distribution of the OTel Java instrumentation, but the approach that we ended up with generalizes to any language that can load and call into a native library. So, how do we get our hands on those IDs?

The OTel Java agent itself keeps track of the active span by storing a stack of spans in the OpenTelemetryContext, which itself is stored in a ThreadLocal variable. We originally considered reading these Java structures directly from BPF, but we eventually decided against that approach. There is no documented specification on how ThreadLocals are implemented, and reliably reading and following the JVM's internal data-structures would incur a high maintenance burden. Any minor update to the JVM could change details of the structure layouts. To add to this, we would also have to reverse engineer how each JVM version lays out Java class fields in memory, as well as how all the high-level Java types used in the context objects are actually implemented under the hood. This approach further wouldn't generalize to any non-JVM language and needs to be repeated for any language that we wish to support.

After we had convinced ourselves that reading Java ThreadLocal directly is not the answer, we decided to look for more portable alternatives instead. The option that we ultimately settled with is to load and call into a C++ library that is responsible for making the required information available via a known and defined interface whenever the span changes.

Other than with Java's ThreadLocals, the details on how a native shared library should expose per-process and per-thread data are well-defined in the System V ABI specification and the architecture specific ELF ABI documents.

Exposing per-process information

Exposing per-process data is easy: we simply declare a global variable . . .

void* elastic_tracecorr_process_storage_v1 = nullptr;

. . . and expose it via ELF symbols. When the user initializes the OTel library to set the service name, we allocate a buffer and populate it with data in a protocol that we defined for this purpose. Once the buffer is fully populated, we update the global pointer to point to the buffer.

On the profiling agent side, we already have code in place that detects libraries and executables loaded into any process's address space. We normally use this mechanism to detect and analyze high-level language interpreters (e.g., libpython, libjvm) when they are loaded, but it also turned out to be a perfect fit to detect the OTel trace correlation library. When the library is detected in a process, we scan the exports, resolve the symbol, and read the per-process information directly from the instrumented process’ memory.

Exposing per-thread information

With the easy part out of the way, let's get to the nitty-gritty portion: exposing per-thread information via thread-local storage (TLS). So, what exactly is TLS, and how does it work? At the most basic level, the idea is to have one instance of a variable for every thread. Semantically you can think of it like having a global Map, although that is not how it is implemented.

On Linux, there are two major options for thread locals: TSD and TLS.

Thread-specific data (TSD)

TSD is the older and probably more commonly known variant. It works by explicitly allocating a key via pthread_key_create — usually during process startup — and passing it to all threads that require access to the thread-local variable. The threads can then pass that key to the pthread_getspecific and pthread_setspecific functions to read and update the variable for the currently running thread.

TSD is simple, but for our purposes it has a range of drawbacks:

The pthread_key_t structure is opaque and doesn't have a defined layout. Similar to the Java ThreadLocals, the underlying data-structures aren't defined by the ABI documents and different libc implementations (glibc, musl) will handle them differently.
We cannot call a function like pthread_getspecific from BPF, so we'd have to reverse engineer and reimplement the logic. Logic may change between libc versions, and we’d have to detect the version and support all variants that may come up in the wild.
TSD performance is not predictable and varies depending on how many thread local variables have been allocated in the process previously. This may not be a huge concern for Java specifically since spans are typically not swapped super rapidly, but it’d likely be quite noticeable for user-mode scheduling languages where the context might need to be swapped at every await point/coroutine yield.

None of this is strictly prohibitive, but a lot of this is annoying at the very least. Let’s see if we can do better!

Thread-local storage (TLS)

Starting with C11 and C++11, both languages support thread local variables directly via the _Thread_local and thread_local storage specifiers, respectively. Declaring a variable as per-thread is now a matter of simply adding the keyword:

thread_local void* elastic_tracecorr_tls_v1 = nullptr;

You might assume that the compiler simply inserts calls to the corresponding pthread function calls when variables declared with this are accessed, but this is not actually the case. The reality is surprisingly complicated, and it turns out that there are four different models of TLS that the compiler can choose to generate. For some of those models, there are further multiple dialects that can be used to implement them. The different models and dialects come with various portability versus performance trade-offs. If you are interested in the details, I suggest reading this blog article that does a great job at explaining them.

The TLS model and dialect are usually chosen by the compiler based on a somewhat opaque and complicated set of architecture-specific rules. Fortunately for us, both gcc and clang allow users to pick a particular one using the -ftls-model and -mtls-dialect arguments. The variant that we ended up picking for our purposes is -ftls-model=global-dynamic and -mtls-dialect=gnu2 (and desc on aarch64).

Let's take a look at the assembly that is being generated when accessing a thread_local variable under these settings. Our function:

void setThreadProfilingCorrelationBuffer(JNIEnv* jniEnv, jobject bytebuffer) {
  if (bytebuffer == nullptr) {
    elastic_tracecorr_tls_v1 = nullptr;
  } else {
    elastic_tracecorr_tls_v1 = jniEnv->GetDirectBufferAddress(bytebuffer);
  }
}

Is compiled to the following assembly code:

Both possible branches assign a value to our thread-local variable. Let’s focus at the right branch corresponding to the nullptr case to get rid of the noise from the GetDirectBufferAddress function call:

lea   rax, elastic_tracecorr_tls_v1_tlsdesc  ;; Load some pointer into rax.
call  qword ptr [rax]                        ;; Read & call function pointer at rax.
mov   qword ptr fs:[rax], 0                  ;; Assign 0 to the pointer returned by
                                             ;; the function that we just called.

The fs: portion of the mov instruction is the actual magic bit that makes the memory read per-thread. We’ll get to that later; let’s first look at the mysterious elastic_tracecorr_tls_v1_tlsdesc variable that the compiler emitted here. It’s an instance of the tlsdesc structure that is located somewhere in the .got.plt ELF section. The structure looks like this:

struct tlsdesc {
  // Function pointer used to retrieve the offset
  uint64_t (*resolver)(tlsdesc*);

  // TLS offset -- more on that later.
  uint64_t tp_offset;
}

The resolver field is initialized with nullptr and tp_offset with a per-executable offset. The first thread-local variable in an executable will usually have offset 0, the next one sizeof(first_var), and so on. At first glance this may appear to be similar to how TSD works, with the call to pthread_getspecific to resolve the actual offset, but there is a crucial difference. When the library is loaded, the resolver field is filled in with the address of __tls_get_addr by the loader (ld.so). __tls_get_addr is a relatively heavy function that allocates a TLS offset that is globally unique between all shared libraries in the process. It then proceeds by updating the tlsdesc structure itself, inserting the global offset and replacing the resolver function with a trivial one:

void* second_stage_resolver(tlsdesc* desc) {
  return tlsdesc->tp_offset;
}

In essence, this means that the first access to a tlsdesc based thread-local variable is rather expensive, but all subsequent ones are cheap. We further know that by the time that our C++ library starts publishing per-thread data, it must have gone through the initial resolving process already. Consequently, all that we need to do is to read the final offset from the process's memory and memorize it. We also refresh the offset every now and then to ensure that we really have the final offset, combating the unlikely but possible race condition that we read the offset before it was initialized. We can detect this case by comparing the resolver address against the address of the __tls_get_addr function exported by ld.so.

Determining the TLS offset from an external process

With that out of the way, the next question that arises is how to actually find the tlsdesc in memory so that we can read the offset. Intuitively one might expect that the dynamic symbol exported on the ELF file points to that descriptor, but that is not actually the case.

$ readelf --wide --dyn-syms elastic-jvmti-linux-x64.so | grep elastic_tracecorr_tls_v1
328: 0000000000000000 	8 TLS 	GLOBAL DEFAULT   19 elastic_tracecorr_tls_v1

The dynamic symbol instead contains an offset relative to the start of the .tls ELF section and points to the initial value that libc initializes the TLS value with when it is allocated. So how does ld.so find the tlsdesc to fill in the initial resolver? In addition to the dynamic symbol, the compiler also emits a relocation record for our symbol, and that one actually points to the descriptor structure that we are looking for.

$ readelf --relocs --wide elastic-jvmti-linux-x64.so | grep R_X86_64_TLSDESC
00000000000426e8  0000014800000024 R_X86_64_TLSDESC   	0000000000000000
elastic_tracecorr_tls_v1 + 0

To read the final TLS offset, we thus simply have to:

Wait for the event notifying us about a new shared library being loaded into a process
Do some cheap heuristics to detect our C++ library, avoiding the more expensive analysis below from being executed for every unrelated library on the system
Analyze the library on disk and scan ELF relocations for our per-thread variable to extract the tlsdesc address
Rebase that address to match where our library was loaded in that particular process
Read the offset from tlsdesc+8

Determining the TLS base

Now that we have the offset, how do we use that to actually read the data that the library puts there for us? This brings us back to the magic fs: portion of the mov instruction that we discussed earlier. In X86, most memory operands can optionally be supplied with a segment register that influences the address translation.

Segments are an archaic construct from the early days of 16-bit X86 where they were used to extend the address space. Essentially the architecture provides a range of segment registers that can be configured with different base addresses, thus allowing more than 16-bits worth of memory to be accessed. In times of 64-bit processors, this is hardly a concern anymore. In fact, X86-64 aka AMD64 got rid of all but two of those segment registers: fs and gs.

So why keep two of them? It turns out that they are quite useful for the use-case of thread-local data. Since every thread can be configured to have its own base address in these segment registers, we can use it to point to a block of data for this specific thread. That is precisely what libc implementations on Linux are doing with the fs segment. The offset that we snatched from the processes memory earlier is used as an address with the fs segment register, and the CPU automatically adds it to the per-thread base address.

To retrieve the base address pointed to by the fs segment register in the kernel, we need to read its destination from the kernel’s task_struct for the thread that we happened to interrupt with our profiling timer event. Getting the task struct is easy because we are blessed with the bpf_get_current_task BPF helper functions. BPF helpers are pretty much syscalls for BPF programs: we can just ask the Linux kernel to hand us the pointer.

Armed with the task pointer, we now have to read the thread.fsbase (X86-64) or thread.uw.tp_value (aarch64) field to get our desired base address that the user-mode process accesses via fs. This is where things get complicated one last time, at least if we wish to support older kernels without BTF support (we do!). The task_struct is huge and there are hundreds of fields that can be present or not depending on how the kernel is configured. Being a core primitive of the scheduler, it is also constantly subject to changes between different kernel versions. On modern Linux distributions, the kernel is typically nice enough to tell us the offset via BTF. On older ones, the situation is more complicated. Since hardcoding the offset is clearly not an option if we hope the code to be portable, we instead have to figure out the offset by ourselves.

We do this by consulting /proc/kallsyms, a file with mappings between kernel functions and their addresses, and then using BPF to dump the compiled code of a kernel function that rarely changes and uses the desired offset. We dynamically disassemble and analyze the function and extract the offset directly from the assembly. For X86-64 specifically, we dump the aout_dump_debugregs function that accesses thread->ptrace_bps, which has consistently been 16 bytes away from the fsbase field that we are interested in for all kernels that we have ever looked at.

Reading TLS data from kernel

With all the required offsets at our hands, we can now finally do what we set out to do in the first place: use them to enrich our stack traces with the OTel trace and span IDs that our C++ library prepared for us!

void maybe_add_otel_info(Trace* trace) {
  // Did user-mode insert a TLS offset for this process? Read it.
  TraceCorrProcInfo* proc = bpf_map_lookup_elem(&tracecorr_procs, &trace->pid);

  // No entry -> process doesn't have the C++ library loaded.
  if (!proc) return;

  // Load the fsbase offset from our global configuration map.
  u32 key = 0;
  SystemConfig* syscfg = bpf_map_lookup_elem(&system_config, &key);

  // Read the fsbase offset from the kernel's task struct.
  u8* fsbase;
  u8* task = (u8*)bpf_get_current_task();
  bpf_probe_read_kernel(&fsbase, sizeof(fsbase), task + syscfg->fsbase_offset);

  // Use the TLS offset to read the **pointer** to our TLS buffer.
  void* corr_buf_ptr;
  bpf_probe_read_user(
    &corr_buf_ptr,
    sizeof(corr_buf_ptr),
    fsbase + proc->tls_offset
  );

  // Read the information that our library prepared for us.
  TraceCorrelationBuf corr_buf;
  bpf_probe_read_user(&corr_buf, sizeof(corr_buf), corr_buf_ptr);

  // If the library reports that we are currently in a trace, store it into
  // the stack trace that will be reported to our user-land process.
  if (corr_buf.trace_present && corr_buf.valid) {
    trace->otel_trace_id.as_int.hi = corr_buf.trace_id.as_int.hi;
    trace->otel_trace_id.as_int.lo = corr_buf.trace_id.as_int.lo;
    trace->otel_span_id.as_int = corr_buf.span_id.as_int;
  }
}

Sending out the mappings

From this point on, everything further is pretty simple. The C++ library sets up a unix datagram socket during startup and communicates the socket path to the profiler via the per-process data block. The stacktraces annotated with the OTel trace and span IDs are sent from BPF to our user-mode profiler process via perf event buffers, which in turn sends the mappings between OTel span and trace and stack trace hashes to the C++ library. Our extensions to the OTel instrumentation framework then read those mappings and insert the stack trace hashes into the OTel trace.

This approach has a few major upsides compared to the perhaps more obvious alternative of sending out the OTel span and trace ID with the profiler’s stacktrace records. We want the stacktrace associations to be stored in the trace indices to allow filtering and aggregating stacktraces by the plethora of fields available on OTel traces. If we were to send out the trace IDs via the profiler's gRPC connection instead, we’d have to search for and update the corresponding OTel trace records in the profiling collector to insert the stack trace hashes.

This is not trivial: stacktraces are sent out rather frequently (every 5 seconds, as of writing) and the corresponding OTel trace might not have been sent and stored by the time the corresponding stack traces arrive in our cluster. We’d have to build a kind of delay queue and periodically retry updating the OTel trace documents, introducing avoidable database work and complexity in the collectors. With the approach of sending stacktrace mappings to the OTel instrumented process instead, the need for server-side merging vanishes entirely.

Trace correlation in action

With all the hard work out of the way, let’s take a look at what trace correlation looks like in action!

Future work: Supporting other languages

We have demonstrated that trace correlation can work nicely for Java, but we have no intention of stopping there. The general approach that we discussed previously should work for any language that can efficiently load and call into our C++ library and doesn’t do user-mode scheduling with coroutines. The problem with user-mode scheduling is that the logical thread can change at any await/yield point, requiring us to update the trace IDs in TLS. Many such coroutine environments like Rust’s Tokio provide the ability to register a callback for whenever the active task is swapped, so they can be supported easily. Other languages, however, do not provide that option.

One prominent example in that category is Go: goroutines are built on user-mode scheduling, but to our knowledge there’s no way to instrument the scheduler. Such languages will need solutions that don’t go via the generic TLS path. For Go specifically, we have already built a prototype that uses pprof labels that are associated with a specific Goroutine, having Go’s scheduler update them for us automatically.

Getting started

We hope this blog post has given you an overview of correlating profiling signals to distributed tracing, and its benefits for end-users.

To get started, download the Elastic distribution of the OTel agent, which contains the new trace correlation library. Additionally, you will need the latest version of Universal Profiling agent, bundled with Elastic Stack version 8.13.

Acknowledgment

We appreciate Trask Stalnaker, maintainer of the OTel Java agent, for his feedback on our approach and for reviewing the early draft of this blog post.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Announcing GA of Elastic distribution of the OpenTelemetry Java Agent

Thu, 12 Sep 2024 00:00:00 GMT

As Elastic continues its commitment to OpenTelemetry (OTel), we are excited to announce general availability of the Elastic Distribution of OpenTelemetry Java (EDOT Java). EDOT Java is a fully compatible drop-in replacement for the OTel Java agent that comes with a set of built-in, useful extensions for powerful additional features and improved usability with Elastic Observability. Use EDOT Java to start the OpenTelemetry SDK with your Java application, and automatically capture tracing data, performance metrics, and logs. Traces, metrics, and logs can be sent to any OpenTelemetry Protocol (OTLP) collector you choose.

With EDOT Java you have access to all the features of the OpenTelemetry Java agent plus:

Access to SDK improvements and bug fixes contributed by the Elastic team before the changes are available upstream in OpenTelemetry repositories.
Access to optional features that can enhance OpenTelemetry data that is being sent to Elastic (for example, inferred spans and span stacktrace).

In this blog post, we will explore the rationale behind our unique distribution, detailing the powerful additional features it brings to the table. We will provide an overview of how these enhancements can be utilized with our distribution, the standard OTel SDK, or the vanilla OTel Java agent. Stay tuned as we conclude with a look ahead at our future plans and what you can expect from Elastic contributions to OTel Java moving forward.

Elastic Distribution of OpenTelemetry Java (EDOT Java)

Until now, Elastic users looking to monitor their Java services through automatic instrumentation had two options: the proprietary Elastic APM Java agent or the vanilla OTel Java agent. While both agents offer robust capabilities and have reached a high level of maturity, each has its distinct advantages and limitations. The OTel Java agent provides extensive instrumentation across a broad spectrum of frameworks and libraries, is highly extensible, and natively emits OTel data. Conversely, the Elastic APM Java agent includes several powerful features absent in the OTel Java agent.

Elastic’s distribution of the OTel Java agent aims to bring together the best aspects of the proprietary Elastic Java agent and the OpenTelemetry Java agent. This distribution enhances the vanilla OTel Java agent with a set of additional features realized through extensions, while still being a fully compatible drop-in replacement.

Elastic’s commitment to OpenTelemetry not only focuses on standardizing data collection around OTel but also includes improving OTel components and integrating Elastic's data collection features into OTel. In this vein, our ultimate goal is to contribute as many features from Elastic’s distribution back to the upstream OTel Java agent; our distribution is designed in such a way that the additional features, realized as extensions, work directly with the OTel SDK. This means they can be used independent of Elastic’s distro — either with the Otel Java SDK or with the vanilla OTel Java agent. We’ll discuss these usage patterns further in the sections below.

Features included

The Elastic distribution of the OpenTelemetry Java agent includes a suite of extensions that deliver the features outlined below.

Inferred spans

In a recent blog post, we introduced inferred spans, a powerful feature designed to enhance distributed traces with additional profiling-based spans.

Inferred spans (blue spans labeled “internal” in the above image) offer valuable insights into sources of latency within the code that might remain uncaptured by purely instrumentation-based traces. In other words, they fill in the gaps between instrumentation-based traces. The Elastic distribution of the OTel Java agent includes the inferred spans feature. It can be enabled by setting the following environment variable.

ELASTIC_OTEL_INFERRED_SPANS_ENABLED=true

Correlation with profiling

With OpenTelemetry embracing profiling and Elastic's proposal to donate its eBPF-based, continuous profiling agent, a new frontier opens up in correlating distributed traces with continuous profiling data. This integration offers unprecedented code-level insights into latency issues and CO2 emission footprints, all within a clearly defined service, transaction, and trace context. To get started, follow this guide to setup universal profiling and the OpenTelemetry integration. In order to get more background information on the feature, check out this blog article, where we explore how these technologies converge to enhance observability and environmental consciousness in software development.

Users of Elastic Universal Profiling can already leverage the Elastic distribution of the OTel Java agent to access this powerful integration. With Elastic's proposed donation of the profiling agent, we anticipate that this capability will soon be available to all OTel users who employ the OTel Java agent in conjunction with the new OTel eBPF profiling.

Span stack traces

In many cases, spans within a distributed trace are relatively coarse-grained, particularly when features like inferred spans are not used. Understanding precisely where in the code path a span originates can be incredibly valuable. To address this need, the Elastic distribution of the OTel Java agent includes the span stack traces feature. This functionality provides crucial insights by collecting corresponding stack traces for spans that exceed a configurable minimum duration, pinpointing exactly where a span is initiated in the code.

This simple yet powerful feature significantly enhances problem troubleshooting, offering developers a clearer understanding of their application’s performance dynamics.

In the example above, it allows you to get the call stack of a gRPC call, which can help understanding which code paths triggered it.

Auto-detection of service and cloud resources

In today's expansive and diverse cloud environments, which often include multiple regions and cloud providers, having information on where your services are operating is incredibly valuable. Particularly in Java services, where the service name is frequently embedded within the deployment artifacts, the ability to automatically retrieve service and cloud resource information marks a substantial leap in usability.

To address this need, the Elastic distribution of the OTel Java agent includes built-in auto detectors for service and cloud resources, specifically for AWS and GCP, sourced from the OpenTelemetry Java Contrib repository. This feature, which is on by default, enhances observability and streamlines the management of services across various cloud platforms, making it a key asset for any cloud-based deployment.

Ways to use the EDOT Java

The Elastic distribution of the OTel Java agent is designed to meet our users exactly where they are, accommodating a variety of needs and strategic approaches. Whether you're looking to fully integrate new observability features or simply enhance existing setups, the Elastic distribution offers multiple technical pathways to leverage its capabilities. This flexibility ensures that users can tailor the agent's implementation to align perfectly with their specific operational requirements and goals.

Using Elastic’s distribution directly

The most straightforward path to harnessing the capabilities described above is by adopting the Elastic distribution of the OTel Java agent as a drop-in replacement for the standard OTel Java agent. Structurally, the Elastic distro functions as a wrapper around the OTel Java agent, maintaining full compatibility with all upstream configuration options and incorporating all its features. Additionally, it includes the advanced features described above that significantly augment its functionality. Users of the Elastic distribution will also benefit from the comprehensive technical support provided by Elastic, which will commence once the agent achieves general availability. To get started, simply download the agent Jar file and attach it to your application:

java -javaagent:/pathto/elastic-otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions with the vanilla OTel Java agent

If you prefer to continue using the vanilla OTel Java agent but wish to take advantage of the features described above, you have the flexibility to do so. We offer a separate agent extensions package specifically designed for this purpose. To integrate these enhancements, simply download and place the extensions jar file into a designated directory and configure the OTel Java agent extensions directory:

OTEL_JAVAAGENT_EXTENSIONS=/pathto/elastic-otel-agentextension.jar
java -javaagent:/pathto/otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions manually with the OTel Java SDK

If you build your instrumentations directly into your applications using the OTel API and rely on the OTel Java SDK instead of the automatic Java agent, you can still use the features we've discussed. Each feature is designed as a standalone component that can be integrated with the OTel Java SDK framework. To implement these features, simply refer to the specific descriptions for each one to learn how to configure the OTel Java SDK accordingly:

Setting up the inferred spans feature with the SDK
Setting up profiling correlation with the SDK
Setting up the span stack traces feature with the SDK
Setting up resource detectors with the SDK

This approach ensures that you can tailor your observability tools to meet your specific needs without compromising on functionality.

Future plans and contributions

We are committed to OpenTelemetry, and our contributions to the OpenTelemetry Java project will continue without limit. Not only are we focused on general improvements within the OTel Java project, but we are also committed to ensuring that the features discussed in this blog post become official extensions to the OpenTelemetry Java SDK/Agent and are included in the OpenTelemetry Java Contrib repository. We have already contributed the span stack trace feature and initiated the contribution of the inferred spans feature, and we are eagerly anticipating the opportunity to add the profiling correlation feature following the successful integration of Elastic’s profiling agent.

Moreover, our efforts extend beyond the current enhancements; we are actively working to port more features from the Elastic APM Java agent to OpenTelemetry. A particularly ambitious yet thrilling endeavor is our project to enable dynamic configurability of the OpenTelemetry Java agent. This future enhancement will allow for the OpenTelemetry Agent Management Protocol (OpAMP) to be used to remotely and dynamically configure OTel Java agents, improving their adaptability and ease of use.

We encourage you to experience the new Elastic distribution of the OTel Java agent and share your feedback with us. Your insights are invaluable as we strive to enhance the capabilities and reach of OpenTelemetry, making it even more powerful and user-friendly.

Check out more information on Elastic Distributions of OpenTelemetry in github and our latest EDOT Blog

Elastic provides the following components of EDOT:

Elastic's contribution: Invokedynamic in the OpenTelemetry Java agent

Thu, 19 Oct 2023 00:00:00 GMT

As the second largest and active Cloud Native Computing Foundation (CNCF) project, OpenTelemetry is well on its way to becoming the ubiquitous, unified standard and framework for observability. OpenTelemetry owes this success to its comprehensive and feature-rich toolset that allows users to retrieve valuable observability data from their applications with low effort. The OpenTelemetry Java agent is one of the most mature and feature-rich components in OpenTelemetry’s ecosystem. It provides automatic instrumentation for JVM-based applications and comes with a broad coverage of auto-instrumentation modules for popular Java-frameworks and libraries.

The original instrumentation approach used in the OpenTelemetry Java agent left the maintenance and development of auto-instrumentation modules subject to some restrictions. As part of our reinforced commitment to OpenTelemetry, Elastic® helps evolve and improve OpenTelemetry projects and components. Elastic’s contribution of the Elastic Common Schema to OpenTelemetry was an important step for the open-source community. As another step in our commitment to OpenTelemetry, Elastic started contributing to the OpenTelemetry Java agent.

Elastic’s invokedynamic-based instrumentation approach

To overcome the above-mentioned limitations in developing and maintaining auto-instrumentation modules in the OpenTelemetry Java agent, Elastic started contributing its invokedynamic-based instrumentation approach to the OpenTelemetry Java agent in July 2023.

To explain the improvement, you should know that in Java, a common approach to do auto-instrumentation of applications is through utilizing Java agents that do bytecode instrumentation at runtime. Byte Buddy is a popular and widespread utility that helps with bytecode instrumentation without the need to deal with Java’s bytecode directly. Instrumentation logic that collects observability data from the target application’s code lives in so-called advice methods. Byte Buddy provides different ways of hooking these advice methods into the target application’s methods:

Advice inlining: The advice method’s code is being copied into the instrumented target method.
Static advice dispatching: The instrumented target method invokes static advice methods that need to be visible by the instrumented code.
Advice dispatching with _ invokedynamic __:_ The instrumented target method uses the JVM’s invokedynamic bytecode instruction to call advice methods that are isolated from the instrumented code.

These different approaches are described in great detail in our related blog post on Elastic’s Java APM agent using invokedynamic. In a nutshell, both approaches, advice inlining and dispatching to static advice methods come with some limitations with respect to writing and maintaining the advice code. So far, the OpenTelemetry Java agent has used advice inlining for its bytecode instrumentation. The resulting limitations on developing instrumentations are documented in corresponding developer guidelines. Among other things, the limitation of not being able to debug advice code is a painful restriction when developing and maintaining instrumentation code.

Elastic’s APM Java agent has been using the invokedynamic approach with its benefits for years — field-proven by thousands of customers. To help improve the OpenTelemetry Java agent, Elastic started contributing the invokedynamic approach with the goal to simplify and improve the development and maintainability of auto-instrumentation modules. The contribution proposal and the implementation outline is documented in more detail in this GitHub issue.

With the new approach in place, Elastic will help migrate existing instrumentations so the OTel Java community can benefit from the invokedynamic -based instrumentation approach.

Elastic supports OTel natively, and has numerous capabilities to help you analyze your application with OTel.

Native OpenTelemetry support in Elastic Observability

Best Practices for instrumenting OpenTelemetry

Independence with OpenTelemetry on Elastic

Instrumenting with OpenTelemetry:

Elastiflix application, a guide to instrument different languages with OpenTelemetry (this is the application the team built to highlight all the languages below)

Python: Auto-instrumentation, Manual instrumentation

Java: Auto-instrumentation, Manual instrumentation

Node.js: Auto-instrumentation, Manual instrumentation

.NET: Auto-instrumentation, Manual instrumentation
Go: Manual instrumentation

Revealing unknowns in your tracing data with inferred spans in OpenTelemetry

Mon, 22 Apr 2024 00:00:00 GMT

In the complex world of microservices and distributed systems, achieving transparency and understanding the intricacies and inefficiencies of service interactions and request flows has become a paramount challenge. Distributed tracing is essential in understanding distributed systems. But distributed tracing, whether manually applied or auto-instrumented, is usually rather coarse-grained. Hence, distributed tracing covers only a limited fraction of the system and can easily miss parts of the system that are the most useful to trace.

Addressing this gap, Elastic developed the concept of inferred spans as a powerful enhancement to traditional instrumentation-based tracing as an extension for the OpenTelemetry Java SDK/Agent. We are in the process of contributing this back to OpenTelemetry, until then our extension can be seamlessly used with the existing OpenTelelemetry Java SDK (as described below).

Inferred spans are designed to augment the visibility provided by instrumentation-based traces, shedding light on latency sources within the application or libraries that were previously uninstrumented. This feature significantly expands the utility of distributed tracing, allowing for a more comprehensive understanding of system behavior and facilitating a deeper dive into performance optimization.

What is inferred spans?

Inferred spans is an observability technique that combines distributed tracing with profiling techniques to illuminate the darker, unobserved corners of your application — areas where standard instrumentation techniques fall short. The inferred spans feature interweaves information derived from profiling stacktraces with instrumentation-based tracing data, allowing for the generation of new spans based on the insights drawn from profiling data.

This feature proves invaluable when dealing with custom code or third-party libraries that significantly contribute to the request latency but lack built-in or external instrumentation support. Often, identifying or crafting specific instrumentation for these segments can range from challenging to outright unfeasible. Moreover, certain scenarios exist where implementing instrumentation is impractical due to the potential for substantial performance overhead. For instance, instrumenting application locking mechanisms, despite their critical role, is not viable because of their ubiquitous nature and the significant latency overhead the instrumentation can introduce to application requests. Still, ideally, such latency issues would be visible within your distributed traces.

Inferred spans ensures a deeper visibility into your application’s performance dynamics including the above-mentioned scenarios.

Inferred spans in action

To demonstrate the inferred spans feature we will use the Java implementation of the Elastiflix demo application. Elasticflix has an endpoint called favorites that does some Redis calls and also includes an artificial delay. First, we use the plain OpenTelemetry Java Agent to instrument our application:

java -javaagent:/path/to/otel-javaagent-.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https:// \
"-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE" \
-jar my-service-name.jar

With the OpenTelemetry Java Agent we get out-of-the-box instrumentation for HTTP entry points and calls to Redis for our Elastiflix application. The resulting traces contain spans for the POST /favorites entrypoint, as well as a few short spans for the calls to Redis.

As you can see in the trace above, it’s not clear where most of the time is spent within the POST /favorites request.

Let’s see how inferred spans can shed light into these areas. You can use the inferred spans feature either manually with your OpenTelemetry SDK (see section below), package it as a drop-in extension for the upstream OpenTelemetry Java agent, or just use Elastic’s distribution of the OpenTelemetry Java agent that comes with the inferred spans feature.

For convenience, we just download the agent jar of the Elastic distribution and extend the configuration to enable the inferred spans feature:

java -javaagent:/path/to/elastic-otel-javaagent-.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https://XX.apm.europe-west3.gcp.cloud.es.io:443 \
"-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE" \
-Delastic.otel.inferred.spans.enabled=true \
-jar my-service-name.jar

The only non-standard option here is elastic.otel.inferred.spans.enabled: The inferred spans Feature is currently opt-in and therefore needs to be enabled explicitly. Running the same application with the inferred spans feature enabled yields more comprehensive traces:

The inferred-spans (colored blue in the above screenshot) follow the naming pattern Class#method. With that, the inferred spans feature helps us pinpoint the exact methods that contribute the most to the overall latency of the request. Note that the parent-child relationship between the HTTP entry span, the Redis spans, and the inferred spans is reconstructed correctly, resulting in a fully functional trace structure.

Examining the handleDelay method within the Elastiflix application reveals the use of a straightforward sleep statement. Although the sleep method is not CPU-bound, the full duration of this delay is captured as inferred spans. This stems from employing the async-profiler's wall clock time profiling, as opposed to solely relying on CPU profiling. The ability of the inferred spans feature to reflect actual latency, including for I/O operations and other non-CPU-bound tasks, represents a significant advancement. It allows for diagnosing and resolving performance issues that extend beyond CPU limitations, offering a more nuanced view of system behavior.

Using inferred spans with your own OpenTelemetry SDK

OpenTelemetry is a highly extensible framework: Elastic embraces this extensibility by also publishing most extensions shipped with our OpenTelemetry Java Distro as standalone-extensions to the OpenTelemetry Java SDK.

As a result, if you do not want to use our distro (e.g., because you don’t need or want bytecode instrumentation in your project), you can still use our extensions, such as the extension for the inferred spans feature. All you need to do is set up the OpenTelemetry SDK in your code and add the inferred spans extension as a dependency:


    co.elastic.otel
    inferred-spans
    {latest version}

During your SDK setup, you’ll have to initialize and register the extension:

InferredSpansProcessor inferredSpans = InferredSpansProcessor.builder()
  .samplingInterval(Duration.ofMillis(10)) //the builder offers all config options
  .build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
  .addSpanProcessor(inferredSpans)
.addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder()
    .setEndpoint("https://")
    .addHeader("Authorization", "Bearer ")
    .build()).build())
  .build();
inferredSpans.setTracerProvider(tracerProvider);

The inferred spans extension seamlessly integrates with the OpenTelemetry SDK Autoconfiguration mechanism. By incorporating the OpenTelemetry SDK and its extensions as dependencies within your application code — rather than through an external agent — you gain the flexibility to configure them using the same environment variables or JVM properties. Once the inferred spans extension is included in your classpath, activating it for autoconfigured SDKs becomes straightforward. Simply enable it using the elastic.otel.inferred.spans.enabled property, as previously described, to leverage the full capabilities of this feature with minimal setup.

How does inferred spans work?

The inferred spans feature leverages the capabilities of collecting wall clock time profiling data of the widely-used async-profiler, a low-overhead, popular production-time profiler in the Java ecosystem. It then transforms the profiling data into actionable spans as part of the distributed traces. But what mechanism allows for this transformation?

Essentially, the inferred spans extension engages with the lifecycle of span events, specifically when a span is either activated or deactivated across any thread via the OpenTelemetry context. Upon the activation of the initial span within a transaction, the extension commences a session of wall-clock profiling via the async-profiler, set to a predetermined duration. Concurrently, it logs the details of all span activations and deactivations, capturing their respective timestamps and the threads on which they occurred.

Following the completion of the profiling session, the extension processes the profiling data alongside the log of span events. By correlating the data, it reconstructs the inferred spans. It's important to note that, in certain complex scenarios, the correlation may assign an incorrect name to a span. To mitigate this and aid in accurate identification, the extension enriches the inferred spans with stacktrace segments under the code.stacktrace attribute, offering users clarity and insight into the precise methods implicated.

Inferred spans vs. correlation of traces with profiling data

In the wake of OpenTelemetry's recent announcement of the profiling signal, coupled with Elastic's commitment to donating the Universal Profiling Agent to OpenTelemetry, you might be wondering about how the inferred spans feature differentiates from merely correlating profiling data with distributed traces using span IDs and trace IDs. Rather than viewing these as competing functionalities, it's more accurate to consider them complementary.

The inferred spans feature and the correlation of tracing with profiling data both employ similar methodologies — melding tracing information with profiling data. However, they each shine in distinct areas. Inferred spans excels at identifying long-running methods that could escape notice with traditional CPU profiling, which is more adept at pinpointing CPU bottlenecks. A unique advantage of inferred spans is its ability to account for I/O time, capturing delays caused by operations like disk access that wouldn't typically be visible in CPU profiling flamegraphs.

However, the inferred spans feature has its limitations, notably in detecting latency issues arising from "death by a thousand cuts" — where a method, although not time-consuming per invocation, significantly impacts total latency due to being called numerous times across a request. While individual calls might not be captured as inferred spans due to their brevity, CPU-bound methods contributing to latency are unveiled through CPU profiling, as flamegraphs display the aggregate CPU time consumed by these methods.

An additional strength of the inferred spans feature lies in its data structure, offering a simplified tracing model that outlines typical parent-child relationships, execution order, and good latency estimates. This structure is achieved by integrating tracing data with span activation/deactivation events and profiling data, facilitating straightforward navigation and troubleshooting of latency issues within individual traces.

Correlating distributed tracing data with profiling data comes with a different set of advantages. Learn more about it in our related blog post, Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation.

What about the performance overhead?

As mentioned before, the inferred spans functionality is based on the widely used async-profiler, known for its minimal impact on performance. However, the efficiency of profiling operations is not without its caveats, largely influenced by the specific configurations employed. A pivotal factor in this balancing act is the sampling interval — the longer the interval between samples, the lower the incurred overhead, albeit at the expense of potentially overlooking shorter methods that could be critical to the inferred spans feature discovery process.

Adjusting the probability-based trace sampling presents another way for optimization, directly influencing the overhead. For instance, setting trace sampling to 50% effectively halves the profiling load, making the inferred spans feature even more resource-efficient on average per request. This nuanced approach to tuning ensures that the inferred spans feature can be leveraged in real-world, production environments with a manageable performance footprint. When properly configured, this feature offers a potent, low-overhead solution for enhancing observability and diagnostic capabilities within production applications.

What’s next for inferred spans and OpenTelemetry?

This blog post outlined and introduced the inferred spans feature available as an extension for the OpenTelemetry Java SDK and built into the newly introduced Elastic OpenTelemetry Java Distro. Inferred spans allows users to troubleshoot latency issues in areas of code that are not explicitly instrumented while utilizing traditional tracing data.

The feature is currently merely a port of the existing feature from the proprietary Elastic APM Agent. With Elastic embracing OpenTelemetry, we plan on contributing this extension to the upstream OpenTelemetry project. For that, we also plan on migrating the extension to the latest async-profiler 3.x release. Try out inferred spans for yourself and see how it can help you diagnose performance problems in your applications.

Combining Elastic Universal Profiling with Java APM Services and Traces

Thu, 20 Jun 2024 00:00:00 GMT

In a previous blog post, we introduced the technical details of how we managed to correlate eBPF profiling data with APM traces. This time, we'll show you how to get this feature up and running to pinpoint CPU bottlenecks in your Java services! The correlation is supported for both OpenTelemetry and the classic Elastic APM Agent. We'll show you how to enable it for both.

Demo Application

For this blog post, we’ll be using the cpu-burner demo application to showcase the correlation capabilities of APM, tracing, and profiling in Elastic. This application was built to continuously execute several CPU-intensive tasks:

It computes Fibonacci numbers using the naive, recursive algorithm.
It hashes random data with the SHA-2 and SHA-3 hashing algorithms.
It performs numerous large background allocations to stress the garbage collector.

The computations of the Fibonacci numbers and the hashing will each be visible as transactions in Elastic: They have been manually instrumented using the OpenTelemetry API.

Setting up Profiling and APM

First, we’ll need to set up the universal profiling host agent on the host where the demo application will run. Starting from version 8.14.0, correlation with APM data is supported and enabled out of the box for the profiler. There is no special configuration needed; we can just follow the standard setup guide. Note that at the time of writing, universal profiling only supports Linux. On Windows, you'll have to use a VM to try the demo. On macOS, you can use colima as docker engine and run the profiling host agent and the demo app in container images.

In addition, we’ll need to instrument our demo application with an APM agent. We can either use the classic Elastic APM agent or the Elastic OpenTelemetry Distribution.

Using the Classic Elastic APM Agent

Starting with version 1.50.0, the classic Elastic APM agent ships with the capability to correlate the traces it captures with the profiling data from universal profiling. We’ll just need to enable it explicitly via the universal_profiling_integration_enabled config option. Here is the standard command line for running the demo application with the setting enabled:

curl -o 'elastic-apm-agent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=co.elastic.apm&a=elastic-apm-agent&v=LATEST'
java -javaagent:elastic-apm-agent.jar \
-Delastic.apm.service_name=cpu-burner-elastic \
-Delastic.apm.secret_token=XXXXX \
-Delastic.apm.server_url= \
-Delastic.apm.application_packages=co.elastic.demo \
-Delastic.apm.universal_profiling_integration_enabled=true \
-jar ./target/cpu-burner.jar

Using OpenTelemetry

The feature is also available as an OpenTelemetry SDK extension. This means you can use it as a plugin for the vanilla OpenTelemetry agent or add it to your OpenTelemetry SDK if you are not using an agent. In addition, the feature ships by default with the Elastic OpenTelemetry Distribution for Java and can be used via any of the possible usage methods. While the extension is currently Elastic-specific, we are already working with the various OpenTelemetry SIGs on standardizing the correlation mechanism, especially now after the eBPF profiling agent has been contributed.

For this demo, we’ll be using the Elastic OpenTelemetry Distro Java agent to run the extension:

curl -o 'elastic-otel-javaagent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=co.elastic.otel&a=elastic-otel-javaagent&v=LATEST'
java -javaagent:./elastic-otel-javaagent.jar \
-Dotel.exporter.otlp.endpoint= \
"-Dotel.exporter.otlp.headers=Authorization=Bearer XXXX" \
-Dotel.service.name=cpu-burner-otel \
-Delastic.otel.universal.profiling.integration.enabled=true \
-jar ./target/cpu-burner.jar

Here, we explicitly enabled the profiling integration feature via the elastic.otel.universal.profiling.integration.enabled property. Note that with an upcoming release of the universal profiling feature, this won’t be necessary anymore! The OpenTelemetry extension will then automatically detect the presence of the profiler and enable the correlation feature based on that.

The demo repository also comes with a Dockerfile, so you can alternatively build and run the app in docker:

docker build -t cpu-burner .
docker run --rm -e OTEL_EXPORTER_OTLP_ENDPOINT= -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer XXXX" cpu-burner

And that’s it for setup; we are now ready to inspect the correlated profiling data!

Analyzing Service CPU Usage

The first thing we can do now is head to the “Flamegraph” view in Universal Profiling and inspect flamegraphs filtered on APM services. Without the APM correlation, universal profiling is limited to filtering on infrastructure concepts, such as hosts, containers, and processes. Below is a screencast showing a flamegraph filtered on the service name of our demo application:

With this filter applied, we get a flamegraph aggregated over all instances of our service. If that is not desired, we could narrow down the filter, e.g. based on the host or container names. Note that the same service-level flamegraph view is also available on the “Universal Profiling” tab in the APM service UI.

The flamegraphs show exactly how the demo application is spending its CPU time, independently of whether it is covered by instrumentation or not. From left to right, we can first see the time spent in application tasks: We can identify the background allocations not covered by APM transactions as well as the SHA-computation and Fibonacci transactions. Interestingly, this application logic only covers roughly 60% of the total CPU time! The remaining time is spent mostly in the G1 garbage collector due to the high allocation rate of our application. The flamegraph shows all G1-related activities and the timing of the individual phases of concurrent tasks. We can easily identify those based on the native function names. This is made possible by universal profiling being capable of profiling and symbolizing the JVM’s C++ code in addition to the Java code.

Pinpointing Transaction Bottlenecks

While the service-level flamegraph already gives good insights on where our transactions consume the most CPU, this is mainly due to the simplicity of the demo application. In real-world applications, it can be much harder to pinpoint that certain stack frames come mostly from certain transactions. For this reason, the APM agent also correlates CPU profiling data from universal profiling on the transaction level.

We can navigate to the “Universal Profiling” tab on the transaction details page to get per-transaction flamegraphs:

For example, let’s have a look at the flamegraph of our transaction computing SHA-2 and SHA-3 hashes of randomly generated data:

Interestingly, the flamegraph uncovers some unexpected results: The transactions spend more time computing the random bytes to be hashed rather than on the hashing itself! So if this were a real-world application, a possible optimization could be to use a more performant random number generator.

In addition, we can see that the MessageDigest.update call for computing the hash values fans out into two different code paths: One is a call into the BouncyCastle cryptography library, the other one is a JVM stub routine, meaning that the JIT compiler has inserted special assembly code for a function.

The flamegraph shown in the screenshot displays the aggregated data for all “shaShenanigans” transactions in the given time filter. We can further filter this down using the transaction filter bar at the top. To make the best use of this, the demo application annotates the transactions with the hashing algorithm used via OpenTelemetry attributes:

public static void shaShenanigans(MessageDigest digest) {
    Span span = tracer.spanBuilder("shaShenanigans")
        .setAttribute("algorithm", digest.getAlgorithm())
        .startSpan();
    ...
    span.end()
}

So, let’s filter our flamegraph based on the used hashing algorithm:

Note that “SHA-256” is the name of the JVM built-in SHA-2 256-bit implementation. This now gives the following flamegraph:

We can see that the BouncyCastle stack frames are gone and MessageDigest.update spends all its time in the JVM stub routines. Therefore, the stub routine is likely hand-crafted assembly from the JVM maintainers for the SHA2 algorithm.

If we instead filter on “SHA3-256”, we get the following result:

Now, as expected, MessageDigest.update spends all its time in the BouncyCastle library for the SHA3 implementation. Note that the hashing here takes up more time in relation to the random data generation, showing that the SHA2 JVM stub routine is significantly faster than the BouncyCastle Java SHA3 implementation.

This filtering is not limited to custom attributes like those shown in this demo. You can filter on any transaction attributes, including latency, HTTP headers, and so on. For example, for typical HTTP applications, it allows analyzing the efficiency of the used JSON serializer based on the payload size. Note that while it is possible to filter on single transaction instances (e.g. based on trace.id), this is not recommended: To allow continuous profiling in production systems, the profiler by default runs with a low sampling rate of 20hz. This means that for typical real-world applications, this will not yield enough data when looking at a single transaction execution. Instead, we gain insights by monitoring multiple executions of a group of transactions over time and aggregating their samples, for example in a flamegraph.

Summary

A common reason for applications to degrade is overly high CPU usage. In this blog post, we showed how to combine universal profiling with APM to find the actual root cause in such cases: We explained how to analyze the CPU time using profiling flamegraphs on service and transaction levels. In addition, we further drilled down into data using custom filters. We used a simple demo application for this purpose, so go ahead and try it yourself with your own, real-world applications to uncover the actual power of the feature!