Elastic Observability Labs - Articles by Israel Ogbole

Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation

Thu, 28 Mar 2024 00:00:00 GMT

Observability goes beyond monitoring; it's about truly understanding your system. To achieve this comprehensive view, practitioners need a unified observability solution that natively combines insights from metrics, logs, traces, and crucially, continuous profiling. While metrics, logs, and traces offer valuable insights, they can't answer the all-important "why." Continuous profiling signals act as a magnifying glass, providing granular code visibility into the system's hidden complexities. They fill the gap left by other data sources, enabling you to answer critical questions –– why is this trace slow? Where exactly in the code is the bottleneck residing?

Traces provide the "what" and "where" — what happened and where in your system. Continuous profiling refines this understanding by pinpointing the "why" and validating your hypotheses about the "what." Just like a full-body MRI scan, Elastic's whole-system continuous profiling (powered by eBPF) uncovers unknown-unknowns in your system. This includes not just your code, but also third-party libraries and kernel activity triggered by your application transactions. This comprehensive visibility improves your mean-time-to-detection (MTTD) and mean-time-to-recovery (MTTR) KPIs.

[Related article: Why metrics, logs, and traces aren’t enough]

Bridging the disconnect between continuous profiling and OTel traces

Historically, continuous profiling signals have been largely disconnected from OpenTelemetry (OTel) traces. Here's the exciting news: we're bridging this gap! We're introducing native correlation between continuous profiling signals and OTel traces, starting with Java.

Imagine this: You're troubleshooting a performance issue and identify a slow trace. Whole-system continuous profiling steps in, acting like an MRI scan for your entire codebase and system. It narrows down the culprit to the specific lines of code hogging CPU time within the context of your distributed trace. This empowers you to answer the "why" question with minimal effort and confidence, all within the same troubleshooting context.

Furthermore, by correlating continuous profiling with distributed tracing, Elastic Observability customers can measure the cloud cost and CO₂ impact of every code change at the service and transaction level.

This milestone is significant, especially considering the recent developments in the OTel community. With OTel adopting profiling and Elastic donating the industry’s most advanced eBPF-based continuous profiling agent to OTel, we're set for a game-changer in observability — empowering OTel end users with a correlated system visibility that goes from a trace span in the userspace down to the kernel.

Furthermore, achieving this goal, especially with Java, presented significant challenges and demanded serious engineering R&D. This blog post will delve into these challenges, explore the approaches we considered in our proof-of-concepts, and explain how we arrived at a solution that can be easily extended to other OTel language agents. Most importantly, this solution correlates traces with profiling signals at the agent, not in the backend — to ensure optimal query performance and minimal reliance on vendor backend storage architectures.

Figuring out the active OTel trace and span

The primary technical challenge in this endeavor is essentially the following: whenever the profiler interrupts an OTel instrumented process to capture a stacktrace, we need to be able to efficiently determine the active span and trace ID (per-thread) and the service name (per-process).

For the purpose of this blog, we'll focus on the recently released Elastic distribution of the OTel Java instrumentation, but the approach that we ended up with generalizes to any language that can load and call into a native library. So, how do we get our hands on those IDs?

The OTel Java agent itself keeps track of the active span by storing a stack of spans in the OpenTelemetryContext, which itself is stored in a ThreadLocal variable. We originally considered reading these Java structures directly from BPF, but we eventually decided against that approach. There is no documented specification on how ThreadLocals are implemented, and reliably reading and following the JVM's internal data-structures would incur a high maintenance burden. Any minor update to the JVM could change details of the structure layouts. To add to this, we would also have to reverse engineer how each JVM version lays out Java class fields in memory, as well as how all the high-level Java types used in the context objects are actually implemented under the hood. This approach further wouldn't generalize to any non-JVM language and needs to be repeated for any language that we wish to support.

After we had convinced ourselves that reading Java ThreadLocal directly is not the answer, we decided to look for more portable alternatives instead. The option that we ultimately settled with is to load and call into a C++ library that is responsible for making the required information available via a known and defined interface whenever the span changes.

Other than with Java's ThreadLocals, the details on how a native shared library should expose per-process and per-thread data are well-defined in the System V ABI specification and the architecture specific ELF ABI documents.

Exposing per-process information

Exposing per-process data is easy: we simply declare a global variable . . .

void* elastic_tracecorr_process_storage_v1 = nullptr;

. . . and expose it via ELF symbols. When the user initializes the OTel library to set the service name, we allocate a buffer and populate it with data in a protocol that we defined for this purpose. Once the buffer is fully populated, we update the global pointer to point to the buffer.

On the profiling agent side, we already have code in place that detects libraries and executables loaded into any process's address space. We normally use this mechanism to detect and analyze high-level language interpreters (e.g., libpython, libjvm) when they are loaded, but it also turned out to be a perfect fit to detect the OTel trace correlation library. When the library is detected in a process, we scan the exports, resolve the symbol, and read the per-process information directly from the instrumented process’ memory.

Exposing per-thread information

With the easy part out of the way, let's get to the nitty-gritty portion: exposing per-thread information via thread-local storage (TLS). So, what exactly is TLS, and how does it work? At the most basic level, the idea is to have one instance of a variable for every thread. Semantically you can think of it like having a global Map, although that is not how it is implemented.

On Linux, there are two major options for thread locals: TSD and TLS.

Thread-specific data (TSD)

TSD is the older and probably more commonly known variant. It works by explicitly allocating a key via pthread_key_create — usually during process startup — and passing it to all threads that require access to the thread-local variable. The threads can then pass that key to the pthread_getspecific and pthread_setspecific functions to read and update the variable for the currently running thread.

TSD is simple, but for our purposes it has a range of drawbacks:

The pthread_key_t structure is opaque and doesn't have a defined layout. Similar to the Java ThreadLocals, the underlying data-structures aren't defined by the ABI documents and different libc implementations (glibc, musl) will handle them differently.
We cannot call a function like pthread_getspecific from BPF, so we'd have to reverse engineer and reimplement the logic. Logic may change between libc versions, and we’d have to detect the version and support all variants that may come up in the wild.
TSD performance is not predictable and varies depending on how many thread local variables have been allocated in the process previously. This may not be a huge concern for Java specifically since spans are typically not swapped super rapidly, but it’d likely be quite noticeable for user-mode scheduling languages where the context might need to be swapped at every await point/coroutine yield.

None of this is strictly prohibitive, but a lot of this is annoying at the very least. Let’s see if we can do better!

Thread-local storage (TLS)

Starting with C11 and C++11, both languages support thread local variables directly via the _Thread_local and thread_local storage specifiers, respectively. Declaring a variable as per-thread is now a matter of simply adding the keyword:

thread_local void* elastic_tracecorr_tls_v1 = nullptr;

You might assume that the compiler simply inserts calls to the corresponding pthread function calls when variables declared with this are accessed, but this is not actually the case. The reality is surprisingly complicated, and it turns out that there are four different models of TLS that the compiler can choose to generate. For some of those models, there are further multiple dialects that can be used to implement them. The different models and dialects come with various portability versus performance trade-offs. If you are interested in the details, I suggest reading this blog article that does a great job at explaining them.

The TLS model and dialect are usually chosen by the compiler based on a somewhat opaque and complicated set of architecture-specific rules. Fortunately for us, both gcc and clang allow users to pick a particular one using the -ftls-model and -mtls-dialect arguments. The variant that we ended up picking for our purposes is -ftls-model=global-dynamic and -mtls-dialect=gnu2 (and desc on aarch64).

Let's take a look at the assembly that is being generated when accessing a thread_local variable under these settings. Our function:

void setThreadProfilingCorrelationBuffer(JNIEnv* jniEnv, jobject bytebuffer) {
  if (bytebuffer == nullptr) {
    elastic_tracecorr_tls_v1 = nullptr;
  } else {
    elastic_tracecorr_tls_v1 = jniEnv->GetDirectBufferAddress(bytebuffer);
  }
}

Is compiled to the following assembly code:

Both possible branches assign a value to our thread-local variable. Let’s focus at the right branch corresponding to the nullptr case to get rid of the noise from the GetDirectBufferAddress function call:

lea   rax, elastic_tracecorr_tls_v1_tlsdesc  ;; Load some pointer into rax.
call  qword ptr [rax]                        ;; Read & call function pointer at rax.
mov   qword ptr fs:[rax], 0                  ;; Assign 0 to the pointer returned by
                                             ;; the function that we just called.

The fs: portion of the mov instruction is the actual magic bit that makes the memory read per-thread. We’ll get to that later; let’s first look at the mysterious elastic_tracecorr_tls_v1_tlsdesc variable that the compiler emitted here. It’s an instance of the tlsdesc structure that is located somewhere in the .got.plt ELF section. The structure looks like this:

struct tlsdesc {
  // Function pointer used to retrieve the offset
  uint64_t (*resolver)(tlsdesc*);

  // TLS offset -- more on that later.
  uint64_t tp_offset;
}

The resolver field is initialized with nullptr and tp_offset with a per-executable offset. The first thread-local variable in an executable will usually have offset 0, the next one sizeof(first_var), and so on. At first glance this may appear to be similar to how TSD works, with the call to pthread_getspecific to resolve the actual offset, but there is a crucial difference. When the library is loaded, the resolver field is filled in with the address of __tls_get_addr by the loader (ld.so). __tls_get_addr is a relatively heavy function that allocates a TLS offset that is globally unique between all shared libraries in the process. It then proceeds by updating the tlsdesc structure itself, inserting the global offset and replacing the resolver function with a trivial one:

void* second_stage_resolver(tlsdesc* desc) {
  return tlsdesc->tp_offset;
}

In essence, this means that the first access to a tlsdesc based thread-local variable is rather expensive, but all subsequent ones are cheap. We further know that by the time that our C++ library starts publishing per-thread data, it must have gone through the initial resolving process already. Consequently, all that we need to do is to read the final offset from the process's memory and memorize it. We also refresh the offset every now and then to ensure that we really have the final offset, combating the unlikely but possible race condition that we read the offset before it was initialized. We can detect this case by comparing the resolver address against the address of the __tls_get_addr function exported by ld.so.

Determining the TLS offset from an external process

With that out of the way, the next question that arises is how to actually find the tlsdesc in memory so that we can read the offset. Intuitively one might expect that the dynamic symbol exported on the ELF file points to that descriptor, but that is not actually the case.

$ readelf --wide --dyn-syms elastic-jvmti-linux-x64.so | grep elastic_tracecorr_tls_v1
328: 0000000000000000 	8 TLS 	GLOBAL DEFAULT   19 elastic_tracecorr_tls_v1

The dynamic symbol instead contains an offset relative to the start of the .tls ELF section and points to the initial value that libc initializes the TLS value with when it is allocated. So how does ld.so find the tlsdesc to fill in the initial resolver? In addition to the dynamic symbol, the compiler also emits a relocation record for our symbol, and that one actually points to the descriptor structure that we are looking for.

$ readelf --relocs --wide elastic-jvmti-linux-x64.so | grep R_X86_64_TLSDESC
00000000000426e8  0000014800000024 R_X86_64_TLSDESC   	0000000000000000
elastic_tracecorr_tls_v1 + 0

To read the final TLS offset, we thus simply have to:

Wait for the event notifying us about a new shared library being loaded into a process
Do some cheap heuristics to detect our C++ library, avoiding the more expensive analysis below from being executed for every unrelated library on the system
Analyze the library on disk and scan ELF relocations for our per-thread variable to extract the tlsdesc address
Rebase that address to match where our library was loaded in that particular process
Read the offset from tlsdesc+8

Determining the TLS base

Now that we have the offset, how do we use that to actually read the data that the library puts there for us? This brings us back to the magic fs: portion of the mov instruction that we discussed earlier. In X86, most memory operands can optionally be supplied with a segment register that influences the address translation.

Segments are an archaic construct from the early days of 16-bit X86 where they were used to extend the address space. Essentially the architecture provides a range of segment registers that can be configured with different base addresses, thus allowing more than 16-bits worth of memory to be accessed. In times of 64-bit processors, this is hardly a concern anymore. In fact, X86-64 aka AMD64 got rid of all but two of those segment registers: fs and gs.

So why keep two of them? It turns out that they are quite useful for the use-case of thread-local data. Since every thread can be configured to have its own base address in these segment registers, we can use it to point to a block of data for this specific thread. That is precisely what libc implementations on Linux are doing with the fs segment. The offset that we snatched from the processes memory earlier is used as an address with the fs segment register, and the CPU automatically adds it to the per-thread base address.

To retrieve the base address pointed to by the fs segment register in the kernel, we need to read its destination from the kernel’s task_struct for the thread that we happened to interrupt with our profiling timer event. Getting the task struct is easy because we are blessed with the bpf_get_current_task BPF helper functions. BPF helpers are pretty much syscalls for BPF programs: we can just ask the Linux kernel to hand us the pointer.

Armed with the task pointer, we now have to read the thread.fsbase (X86-64) or thread.uw.tp_value (aarch64) field to get our desired base address that the user-mode process accesses via fs. This is where things get complicated one last time, at least if we wish to support older kernels without BTF support (we do!). The task_struct is huge and there are hundreds of fields that can be present or not depending on how the kernel is configured. Being a core primitive of the scheduler, it is also constantly subject to changes between different kernel versions. On modern Linux distributions, the kernel is typically nice enough to tell us the offset via BTF. On older ones, the situation is more complicated. Since hardcoding the offset is clearly not an option if we hope the code to be portable, we instead have to figure out the offset by ourselves.

We do this by consulting /proc/kallsyms, a file with mappings between kernel functions and their addresses, and then using BPF to dump the compiled code of a kernel function that rarely changes and uses the desired offset. We dynamically disassemble and analyze the function and extract the offset directly from the assembly. For X86-64 specifically, we dump the aout_dump_debugregs function that accesses thread->ptrace_bps, which has consistently been 16 bytes away from the fsbase field that we are interested in for all kernels that we have ever looked at.

Reading TLS data from kernel

With all the required offsets at our hands, we can now finally do what we set out to do in the first place: use them to enrich our stack traces with the OTel trace and span IDs that our C++ library prepared for us!

void maybe_add_otel_info(Trace* trace) {
  // Did user-mode insert a TLS offset for this process? Read it.
  TraceCorrProcInfo* proc = bpf_map_lookup_elem(&tracecorr_procs, &trace->pid);

  // No entry -> process doesn't have the C++ library loaded.
  if (!proc) return;

  // Load the fsbase offset from our global configuration map.
  u32 key = 0;
  SystemConfig* syscfg = bpf_map_lookup_elem(&system_config, &key);

  // Read the fsbase offset from the kernel's task struct.
  u8* fsbase;
  u8* task = (u8*)bpf_get_current_task();
  bpf_probe_read_kernel(&fsbase, sizeof(fsbase), task + syscfg->fsbase_offset);

  // Use the TLS offset to read the **pointer** to our TLS buffer.
  void* corr_buf_ptr;
  bpf_probe_read_user(
    &corr_buf_ptr,
    sizeof(corr_buf_ptr),
    fsbase + proc->tls_offset
  );

  // Read the information that our library prepared for us.
  TraceCorrelationBuf corr_buf;
  bpf_probe_read_user(&corr_buf, sizeof(corr_buf), corr_buf_ptr);

  // If the library reports that we are currently in a trace, store it into
  // the stack trace that will be reported to our user-land process.
  if (corr_buf.trace_present && corr_buf.valid) {
    trace->otel_trace_id.as_int.hi = corr_buf.trace_id.as_int.hi;
    trace->otel_trace_id.as_int.lo = corr_buf.trace_id.as_int.lo;
    trace->otel_span_id.as_int = corr_buf.span_id.as_int;
  }
}

Sending out the mappings

From this point on, everything further is pretty simple. The C++ library sets up a unix datagram socket during startup and communicates the socket path to the profiler via the per-process data block. The stacktraces annotated with the OTel trace and span IDs are sent from BPF to our user-mode profiler process via perf event buffers, which in turn sends the mappings between OTel span and trace and stack trace hashes to the C++ library. Our extensions to the OTel instrumentation framework then read those mappings and insert the stack trace hashes into the OTel trace.

This approach has a few major upsides compared to the perhaps more obvious alternative of sending out the OTel span and trace ID with the profiler’s stacktrace records. We want the stacktrace associations to be stored in the trace indices to allow filtering and aggregating stacktraces by the plethora of fields available on OTel traces. If we were to send out the trace IDs via the profiler's gRPC connection instead, we’d have to search for and update the corresponding OTel trace records in the profiling collector to insert the stack trace hashes.

This is not trivial: stacktraces are sent out rather frequently (every 5 seconds, as of writing) and the corresponding OTel trace might not have been sent and stored by the time the corresponding stack traces arrive in our cluster. We’d have to build a kind of delay queue and periodically retry updating the OTel trace documents, introducing avoidable database work and complexity in the collectors. With the approach of sending stacktrace mappings to the OTel instrumented process instead, the need for server-side merging vanishes entirely.

Trace correlation in action

With all the hard work out of the way, let’s take a look at what trace correlation looks like in action!

Future work: Supporting other languages

We have demonstrated that trace correlation can work nicely for Java, but we have no intention of stopping there. The general approach that we discussed previously should work for any language that can efficiently load and call into our C++ library and doesn’t do user-mode scheduling with coroutines. The problem with user-mode scheduling is that the logical thread can change at any await/yield point, requiring us to update the trace IDs in TLS. Many such coroutine environments like Rust’s Tokio provide the ability to register a callback for whenever the active task is swapped, so they can be supported easily. Other languages, however, do not provide that option.

One prominent example in that category is Go: goroutines are built on user-mode scheduling, but to our knowledge there’s no way to instrument the scheduler. Such languages will need solutions that don’t go via the generic TLS path. For Go specifically, we have already built a prototype that uses pprof labels that are associated with a specific Goroutine, having Go’s scheduler update them for us automatically.

Getting started

We hope this blog post has given you an overview of correlating profiling signals to distributed tracing, and its benefits for end-users.

To get started, download the Elastic distribution of the OTel agent, which contains the new trace correlation library. Additionally, you will need the latest version of Universal Profiling agent, bundled with Elastic Stack version 8.13.

Acknowledgment

We appreciate Trask Stalnaker, maintainer of the OTel Java agent, for his feedback on our approach and for reviewing the early draft of this blog post.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

OpenTelemetry and Elastic: Working together to establish continuous profiling for the community

Tue, 12 Mar 2024 00:00:00 GMT

Profiling is emerging as a core pillar of observability, aptly dubbed the fourth pillar, with the OpenTelemetry (OTel) project leading this essential development. This blog post dives into the recent advancements in profiling within OTel and how Elastic® is actively contributing toward it.

At Elastic, we’re big believers in and contributors to the OpenTelemetry project. The project’s benefits of flexibility, performance, and vendor agnosticism have been making their rounds; we’ve seen a groundswell of customer interest.

To this end, after donating our Elastic Common Schema and our invokedynamic based java agent approach, we recently announced our intent to donate our continuous profiling agent — a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols, or service restarts.

Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging eBPF, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.

Enabling profiling in OpenTelemetry: A step toward unified observability

Elastic actively participates in the OTel community, particularly within the Profiling Special Interest Group (SIG). This group has been instrumental in defining the OTel Profiling Data Model, a crucial step toward standardizing profiling data.

The recent merger of the OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP) marks a significant milestone. With the standardization of profiles as a core observability pillar alongside metrics, tracing, and logs, OTel offers a comprehensive suite of observability tools, empowering users to gain a holistic view of their applications' health and performance.

In line with this advancement, we are donating our whole-system, eBPF-based continuous profiling agent to OTel. In parallel, we are implementing the experimental OTel Profiling signal in the profiling agent, to ensure and demonstrate OTel protocol compatibility in the agent and prepare it for a fully OTel-based collection of profiling signals and correlate it to logs, metrics, and traces.

Why is Elastic donating the eBPF-based profiling agent to OpenTelemetry?

Computational efficiency has always been a critical concern for software professionals. However, in an era where every line of code affects both the bottom line and the environment, there's an additional reason to focus on it. Elastic is committed to helping the OpenTelemetry community enhance computational efficiency because efficient software not only reduces the cost of goods sold (COGS) but also reduces carbon footprint.

We have seen firsthand — both internally and from our customers' testimonials — how profiling insights aid in enhancing software efficiency. This results in an improved customer experience, lower resource consumption, and reduced cloud costs.

Moreover, adopting a whole-system profiling strategy, such as Elastic Universal Profiling, differs significantly from traditional instrumentation profilers that focus solely on runtime. Elastic Universal Profiling provides whole-system visibility, profiling not only your own code but also third-party libraries, kernel operations, and other code you don't own. This comprehensive approach facilitates rapid optimizations by identifying non-optimal common libraries and uncovering "unknown unknowns" that consume CPU cycles. Often, a tipping point is reached when the resource consumption of libraries or certain daemon processes exceeds that of the applications themselves. Without system-wide profiling, along with the capabilities to slice data per service and aggregate total usage, pinpointing these resource-intensive components becomes a formidable challenge.

At Elastic, we have a customer with an extensive cloud footprint who plans to negotiate with their cloud provider to reclaim money for the significant compute resource consumed by the cloud provider's in-VM agents. These examples highlight the importance of whole-system profiling and the benefits that the OpenTelemetry community will gain if the donation proposal is accepted.

Specifically, OTel users will gain access to a lightweight, battle-tested production-grade continuous profiling agent with the following features:

Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing, and the agent typically manages to stay way below that)
Support for native C/C++ executables without the need for DWARF debug information by leveraging .eh_frame data, as described in “How Universal Profiling unwinds stacks without frame pointers and symbols”
Support profiling of system libraries without frame pointers and without debug symbols on the host
Support for mixed stacktraces between runtimes — stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages
Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)
Support for a broad set of High-level languages (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation
100% non-intrusive: there's no need to load agents or libraries into the processes that are being profiled
No need for any reconfiguration, instrumentation, or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration
Support for x86 and Arm64 CPU architectures
Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains
Support for Probabilistic Profiling to reduce data storage costs
. . . and more

Elastic's commitment to enhancing computational efficiency and our belief in the OpenTelemetry vision underscores our dedication to advancing the observability ecosystem –– by donating the profiling agent. Elastic is not only contributing technology but also dedicating a team of specialized profiling domain experts to co-maintain and advance the profiling capabilities within OpenTelemetry.

How does this donation benefit the OTel community?

Metrics, logs, and traces offer invaluable insights into system health. But what if you could unlock an even deeper level of visibility? Here's why profiling is a perfect complement to your OTel toolkit:

1. Deep system visibility: Beyond the surface

Think of whole-system profiling as an MRI scan for your fleet. It goes deeper into the internals of your system, revealing hidden performance issues lurking beneath the surface. You can identify "unknown unknowns" — inefficiencies you wouldn't have noticed otherwise — and gain a comprehensive understanding of how your system functions at its core.

2. Cross-signal correlation: Answering "why" with confidence

The Elastic Universal Profiling agent supports trace correlation with the OTel Java agent/SDK (with Go support coming soon!). This correlation enables OTel users to view profiling data by services or service endpoints, allowing for a more context-aware and targeted root cause analysis. This powerful combination allows you to pinpoint the exact cause of resource consumption at the trace level. No more guessing why specific functions hog CPU or why certain events occur. You can finally answer the critical "why" questions with precision, enabling targeted optimization efforts.

3. Cost and sustainability optimization: Beyond performance

Our approach to profiling goes beyond just performance gains. By correlating whole-system profiling data with tracing, we can help you measure the environmental impact and cloud cost associated with specific services and functionalities within your application. This empowers you to make data-driven decisions that optimize both performance and resource utilization, leading to a more sustainable and cost-effective operation.

Elastic's commitment to OpenTelemetry

Elastic currently supports a growing list of Cloud Native Computing Foundation (CNCF) projects such as Kubernetes (K8S), Prometheus, Fluentd, Fluent Bit, and Istio. Elastic’s application performance monitoring (APM) also natively supports OTel, ensuring all APM capabilities are available with either Elastic or OTel agents or a combination of the two. In addition to the ECS contribution and ongoing collaboration with OTel SemConv, Elastic has continued to make contributions to other OTel projects, including language SDKs (such as OTel Swift, OTel Go, OTel Ruby, and others), and participates in several special interest groups (SIGs) to establish OTel as a standard for observability and security.

We are excited about our strengthening relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community.Learn more about Elastic’s OpenTelemetry support or contribute to the donation proposal or just join the conversation.

Stay tuned for further updates as the profiling part of OTel continues to evolve.

Elastic Universal Profiling agent, a continuous profiling solution, is now open source

Mon, 15 Apr 2024 00:00:00 GMT

Elastic Universal Profiling™ agent is now open source! The industry’s most advanced fleetwide continuous profiling solution empowers users to identify performance bottlenecks, reduce cloud spend, and minimize their carbon footprint. This post explores the history of the agent, its move to open source, and its future integration with OpenTelemetry.

Elastic Universal Profiling™ Agent goes open source under Apache 2

At Elastic, open source is more than just a philosophy — it's our DNA. We believe the benefits of whole-system continuous profiling extend far beyond performance optimization. It's a win for businesses and the planet alike. For instance, since launching Elastic Universal Profiling in general availability (GA), we've observed a wide variety of use cases from customers.

These range from customers relying fully on Universal Profiling's differential flame graphs and topN functions for insights during release management to utilizing AI assistants for quickly optimizing expensive functions. This includes using profiling data to identify the optimal energy-efficient cloud region to run certain workloads. Additionally, customers are using insights that Universal Profiling provides to build evidence to challenge cloud provider bills. As it turns out, cloud providers' in-VM agents can consume a significant portion of the CPU time, which customers are billed for.

In a move that will empower the community to take advantage of continuous profiling's benefits, we're thrilled to announce that the Elastic Universal Profiling agent , a pioneering eBPF-based continuous profiling agent, is now open source under the Apache 2 license!

This move democratizes hyper-scaler efficiency for everyone , opening exciting new possibilities for the future of continuous profiling, as well as its role in observability and OpenTelemetry.

Implementation of the OpenTelemetry (OTel) Profiling protocol

Our commitment to open source goes beyond just the agent itself. We recently announced our intent to donate the agent to OpenTelemetry and have further solidified this goal by implementing the experimental OTel Profiling data model. This allows the open-sourced eBPF-based continuous profiling agent to communicate seamlessly with OpenTelemetry backends.

But that's not all! We've also launched an innovative feature that correlates profiling data with OpenTelemetry distributed traces. This powerful capability offers a deeper level of insight into application performance, enabling the identification of bottlenecks with greater precision. Upon donating the Profiling agent to OTel, Elastic will also contribute critical components that enable distributed trace correlation within the Elastic distribution of the OTel Java agent to the upstream OTel Java SDK. This underscores Elastic Observability's commitment to both open source and the support of open standards like OpenTelemetry while pushing the boundaries of what is possible in observability.

What does this mean for Elastic Universal Profiling customers?

We'd like to express our immense gratitude to all our customers who have been part of this journey, from the early stages of private beta to GA. Your feedback has been invaluable in shaping Universal Profiling into the powerful product it is today.

By open-sourcing the Universal Profiling agent and contributing it to OpenTelemetry, we're fostering a win-win situation for both you and the broader community. This move opens doors for innovation and collaboration, ultimately leading to a more robust and versatile whole-system continuous profiling solution for everyone.

Furthermore, we're actively working on exciting novel ways to integrate Universal Profiling seamlessly within Elastic Observability. Expect further announcements soon, outlining how you can unlock even greater value from your profiling data within a unified observability experience in a way that has never been done before.

The open-sourced agent is using the recently released (experimental) OTel Profiling signal. As a precaution, we recommend not using it in production environments.

Please continue using the official Elastic distribution of the Universal Profiling agent until the agent is formally accepted by OTel and the protocol reaches a stable phase. There's no need to take any action at this time, and we will ensure to have a smooth transition plan in place for you.

What does this mean for the OpenTelemetry community?

OpenTelemetry is adopting continuous profiling as a key signal. By open-sourcing the eBPF-based profiling agent and working towards donating it to OTel, Elastic is making it possible to accelerate the standardization of continuous profiling within OpenTelemetry. This move has a massive impact on the observability community, empowering everyone to continuously profile their systems with a standardized protocol.

This is particularly timely as Moore's Law slows down and cloud computing takes hold, making computational efficiency critical for businesses.

Here's how whole-system continuous profiling benefits you:

Maximize gross margins: By reducing the computational resources needed to run applications, businesses can optimize their cloud spend and improve profitability. Whole-system continuous profiling is one way of identifying the most expensive applications (down to the lines of code) across diverse environments that may span multiple cloud providers. This principle aligns with the familiar adage, "a penny saved is a penny earned." In the cloud context, every CPU cycle saved translates to money saved.
Minimize environmental impact: Energy consumption associated with computing is a growing concern (source: MIT Energy Initiative). More efficient code translates to lower energy consumption, contributing to a reduction in carbon footprint.
Accelerate engineering workflows: Continuous profiling provides detailed insights to help debug complex issues faster, guide development, and improve overall code quality.

This is where Elastic Universal Profiling comes in — designed to help organizations run efficient services by minimizing computational wastage. To this end, it measures code efficiency in three dimensions: CPU utilization , CO** 2 , and cloud cost**.

Elastic's journey with continuous profiling began by joining forces with optimyze.cloud –– this became the foundation for Elastic Universal Profiling. We are excited to see this product evolve into its next growth phase in the open-source world.

Ready to give it a spin?

As Elastic Universal Profiling transitions into this new open source era, the potential for transformative impact on performance optimization, cost efficiency, and environmental sustainability is immense. Elastic's approach — balancing innovation with responsibility — paves the way for a future where technology not only powers our world but does so in a way that is sustainable and accessible to all.

Get started with the open source Elastic Universal Profiling agent today! Download it directly from GitHub and follow the instructions in the repository.

Unlocking whole-system visibility with Elastic Universal Profiling™

Mon, 25 Sep 2023 00:00:00 GMT

Identify, optimize, measure, repeat!

SREs and developers who want to maintain robust, efficient systems and achieve optimal code performance need effective tools to measure and improve code performance. Profilers are invaluable for these tasks, as they can help you boost your app's throughput, ensure consistent system reliability, and gain a deeper understanding of your code's behavior at runtime. However, traditional profilers can be cumbersome to use, as they often require code recompilation and are limited to specific languages. Additionally, they can also have a high overhead that negatively affects performance and makes them less suitable for quick, real-time debugging in production environments.

To address the limitations of traditional profilers, Elastic^® recently announced the general availability of Elastic Universal Profiling, a continuous profiling product that is refreshingly straightforward to use, eliminating the need for instrumentation, recompilations, or restarts. Moreover, Elastic Universal Profiling does not require on-host debug symbols and is language-agnostic, allowing you to profile any process running on your machines — from your application's code to third-party libraries and even kernel functions.

However, even the most advanced tools require a certain level of expertise to interpret the data effectively. The wealth of visual profiling data — flamegraphs, stacktraces, or functions — can initially seem overwhelming. This blog post aims to demystify continuous profiling and guide you through its unique visualizations. We will equip you with the knowledge to derive quick, actionable insights from Universal Profiling.

Let’s begin.

Stacktraces: The cornerstone for profiling

It all begins with a stacktrace — a snapshot capturing the cascade of function calls.

A stacktrace is a snapshot of the call stack of an application at a specific point in time. It captures the sequence of function calls that the program has made up to that point. In this way, a stacktrace serves as a historical record of the call stack, allowing you to trace back the steps that led to a particular state in your application.

Further, stacktraces are the foundational data structure that profilers rely on to determine what an application is executing at any given moment. This is particularly useful when, for instance, your infrastructure monitoring indicates that your application servers are consuming 95% of CPU resources. While utilities such as 'top -H' can show the top processes that are consuming CPU, they lack the granularity needed to identify the specific lines of code (in the top process) responsible for the high usage.

In the case of Elastic Universal Profiling, eBPF is used to perform sampling of every process that is keeping a CPU core busy. Unlike most instrumentation profilers that focus solely on your application code, Elastic Universal Profiling provides whole-system visibility — it profiles not just your code, but also code you don't own, including third-party libraries and even kernel operations.

The diagram below shows how the Universal Profiling agent works at a very high level. Step 5 indicates the ingestion of the stacktraces into the profiling collector, a new part of the Elastic Stack.

_ Just _ _ deploy the profiling host agent _ and receive profiling data (in Kibana^®) a few minutes later. _ Get started now __ . _

Unwinder eBPF programs (bytecode) are sent to the kernel.
The kernel verifies that the BPF program is safe. If accepted, the program is attached to the probes and executed when the event occurs.
The eBPF programs pass the collected data to userspace via maps.
The agent reads the collected data from maps. The data transferred from the agent to the maps are process-specific and interpreter-specific meta-information that help the eBPF unwinder programs perform unwinding.
Stacktraces, metrics, and metadata are pushed to the Elastic Stack.
Visualize data as flamegraphs, stacktraces, and functions via Kibana.

While stacktraces are the key ingredient for most profiling tools, interpreting them can be tricky. Let's take a look at a simple example to make things a bit easier. The table below shows a group of stacktraces from a Java application and assigns each a percentage to indicate its share of CPU time consumption.

Table 1: Grouped Stacktraces with CPU Time Percentage

Percentage	Function Calls
60%	startApp -> authenticateUser -> processTransaction
20%	startApp -> loadAccountDetails -> fetchRecentTransactions
10%	startApp -> authenticateUser -> processTransaction -> verifyFunds
2%	startApp -> authenticateUser -> processTransaction ->libjvm.so
1%	startApp -> authenticateUser -> processTransaction ->libjvm.so ->vmlinux: asm_common_interrupt ->vmlinux: asm_sysvec_apic_timer_interrupt

The percentages above represent the relative frequency of each specific stacktrace compared to the total number of stacktraces collected over the observation period, not actual CPU usage percentages. Also, the libjvm.so and kernel frames (vmlinux:*) in the example are commonly observed with whole-system profilers like Elastic Universal Profiling.

Also, we can see that 60% of the time is spent in the sequence startApp; authenticateUser; processTransaction. An additional 10% of the processing time is allocated to verifyFunds, a function invoked by processTransaction. Given these observations, it becomes evident that optimization initiatives would yield the most impact if centered on the processTransaction function, as it is one of the most expensive functions. However, real-world stacktraces can be far more intricate than this example. So how do we make sense of them quickly? The answer to this problem resulted in the creation of flamegraphs.

Flamegraphs: A visualization of stacktraces

While the above example may appear straightforward, it scarcely reflects the complexities encountered when aggregating multiple stacktraces across a fleet of machines on a continuous basis. The depth of the stack traces and the numerous branching paths can make it increasingly difficult to pinpoint where code is consuming resources. This is where flamegraphs, a concept popularized by Brendan Gregg, come into play.

A flamegraph is a visual interpretation of stacktraces, designed to quickly and accurately identify the functions that are consuming the most resources. Each function is represented by a rectangle, where the width of the rectangle represents the amount of time spent in the function, and the number of stacked rectangles represents the stack depth. The stack depth is the number of functions that were called to reach the current function.

Elastic Universal Profiling uses icicle graphs, which is an inverted variant of the standard flamegraph. In an icicle graph, the root function is at the top, and its child functions are shown below their parents –– making it easier to see the hierarchy of functions and how they are related to each other.

In most flamegraphs, the y-axis represents stack depth, but there is no standardization for the x-axis. Some profiling tools use the x-axis to indicate the passage of time; in these instances, the graph is more accurately termed a flame chart. Others sort the x-axis alphabetically. Universal Profiling sorts functions on the x-axis based on relative CPU percentage utilization, starting with the function that consumes the most CPU time on the left, as shown in the example icicle graph below.

Debugging and optimizing performance issues: Stacktraces, TopN functions, flamegraphs

SREs and SWEs can use Universal Profiling for troubleshooting, debugging, and performance optimization. It builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions , reduce wasteful computations , and debug complex issues faster.

To this end, Universal Profiling offers three main visualizations: Stacktraces, TopN Functions, and flamegraphs.

Stacktrace view

The stacktraces view shows grouped stacktrace graphs by threads, hosts, Kubernetes deployments, and containers. It can be used to detect unexpected CPU spikes across threads and drill down into a smaller time range to investigate further with a flamegraph. Refer to the documentation for details.

TopN functions view

Universal Profiling's topN functions view shows the most frequently sampled functions, broken down by CPU time, annualized CO₂, and annualized cost estimates. You can use this view to identify the most expensive functions across your entire fleet, and then apply filters to focus on specific components for a more detailed analysis. Clicking on a function name will redirect you to the flamegraph, enabling you to examine the call hierarchy.

Flamegraphs view

The flamegraph page is where you will most likely spend the most time, especially when debugging and optimizing. We recommend that you use the guide below to identify performance bottlenecks and optimization opportunities with flamegraphs. The three key elements-conditions to look for are width , hierarchy , and height.

Width matters: In icicle graphs, wider rectangles signify functions taking up more CPU time. Always read the graph from left to right and note the widest rectangles, as these are the prime hot spots.

Hierarchy matters: Navigate the graph's stack to understand function relationships. This vertical examination will help you identify whether one or multiple functions are responsible for performance bottlenecks. This could also uncover opportunities for code improvements, such as swapping an inefficient library or avoiding unnecessary I/O operations.

Height matters: Elevated or tall stacks in the graph usually point to deep call hierarchies. These can be an indicator of complex and less efficient code structures that may require attention.

Also, when navigating a flamegraph, you may want to look for specific function names to validate your assumptions on their presence: in the Universal Profiling flamegraphs view, there is a “Search” bar at the bottom left corner of the view. You can input a regex, and the match will be highlighted in the flamegraph; by clicking on the left and right arrows next to the Search bar, you can move across the occurrences on the flamegraph and spot callers and callee of the matched function.

In summary,

Scan horizontally from left to right, focusing on width for CPU-intensive functions.
Examine vertically to examine the stack and spot bottlenecks.
Look for towering stacks to identify potential complexities in the code.

To recap, use topN functions to generate optimization hypotheses and validate them with stacktraces and/or flamegraphs. Use stacktraces to monitor CPU utilization trends and to delve into the finer details. Use flamegraphs to quickly debug and optimize your code, using width, hierarchy, and height as guides.

_ Identify. Optimize. Measure. Repeat! _

Measure the impact of your change

For the very first time in history, developers can now measure the performance (gained or lost), cloud cost, and carbon footprint impact of every deployed change.

Once you have identified a performance issue and applied fixes or optimizations to your code, it is essential to measure the impact of your changes. The differential topN functions and differential flamegraph pages are invaluable for this, as they can help you identify regressions and measure your change impact not only in terms of performance but also in terms of carbon emissions and cost savings.

The Diff column indicates a change in the function’s rank.

You may need to use tags or other metadata, such as container and deployment name, in combination with time ranges to differentiate between the optimized and non-optimized changes.

Universal Profiling: The key to optimizing application resources

Computational efficiency is no longer just a nice-to-have, but a must-have from both a financial and environmental sustainability perspective. Elastic Universal Profiling provides unprecedented visibility into the runtime behavior of all your applications, so you can identify and optimize the most resource-intensive areas of your code. The result is not merely better-performing software but also reduced resource consumption, lower cloud costs, and a reduction in carbon footprint. Optimizing your code with Universal Profiling is not only the right thing to do for your business, it’s the right thing to do for our world.

Get started with Elastic Universal Profiling today.