OpenTelemetry and Elastic: Working together to establish continuous profiling for the community

ecs-otel-announcement-1.jpeg

Profiling is emerging as a core pillar of observability, aptly dubbed the fourth pillar, with the OpenTelemetry (OTel) project leading this essential development. This blog post dives into the recent advancements in profiling within OTel and how Elastic® is actively contributing toward it. 

At Elastic, we’re big believers in and contributors to the OpenTelemetry project. The project’s benefits of flexibility, performance, and vendor agnosticism have been making their rounds; we’ve seen a groundswell of customer interest.

To this end, after donating our Elastic Common Schema and our invokedynamic based java agent approach, we recently announced our intent to donate our continuous profiling agent — a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols, or service restarts.   

Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging eBPF, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.  

Enabling profiling in OpenTelemetry: A step toward unified observability

Elastic actively participates in the OTel community, particularly within the Profiling Special Interest Group (SIG). This group has been instrumental in defining the OTel Profiling Data Model, a crucial step toward standardizing profiling data.

The recent merger of the OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP) marks a significant milestone. With the standardization of profiles as a core observability pillar alongside metrics, tracing, and logs, OTel offers a comprehensive suite of observability tools, empowering users to gain a holistic view of their applications' health and performance.

In line with this advancement, we are donating our whole-system, eBPF-based continuous profiling agent to OTel. In parallel, we are implementing the experimental OTel Profiling signal in the profiling agent, to ensure and demonstrate OTel protocol compatibility in the agent and prepare it for a fully OTel-based collection of profiling signals and correlate it to logs, metrics, and traces.

Why is Elastic donating the eBPF-based profiling agent to OpenTelemetry?

Computational efficiency has always been a critical concern for software professionals. However, in an era where every line of code affects both the bottom line and the environment, there's an additional reason to focus on it. Elastic is committed to helping the OpenTelemetry community enhance computational efficiency because efficient software not only reduces the cost of goods sold (COGS) but also reduces carbon footprint. 

We have seen firsthand — both internally and from our customers' testimonials — how profiling insights aid in enhancing software efficiency. This results in an improved customer experience, lower resource consumption, and reduced cloud costs.

A differential flamegraph showing regression in release comparison
A differential flamegraph showing regression in release comparison

Moreover, adopting a whole-system profiling strategy, such as Elastic Universal Profiling, differs significantly from traditional instrumentation profilers that focus solely on runtime. Elastic Universal Profiling provides whole-system visibility, profiling not only your own code but also third-party libraries, kernel operations, and other code you don't own. This comprehensive approach facilitates rapid optimizations by identifying non-optimal common libraries and uncovering "unknown unknowns" that consume CPU cycles. Often, a tipping point is reached when the resource consumption of libraries or certain daemon processes exceeds that of the applications themselves. Without system-wide profiling, along with the capabilities to slice data per service and aggregate total usage, pinpointing these resource-intensive components becomes a formidable challenge.

At Elastic, we have a customer with an extensive cloud footprint who plans to negotiate with their cloud provider to reclaim money for the significant compute resource consumed by the cloud provider's in-VM agents. These examples highlight the importance of whole-system profiling and the benefits that the OpenTelemetry community will gain if the donation proposal is accepted. 

Specifically, OTel users will gain access to a lightweight, battle-tested production-grade continuous profiling agent with the following features: 

  • Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing, and the agent typically manages to stay way below that)

  • Support for native C/C++ executables without the need for DWARF debug information by leveraging .eh_frame data, as described in “How Universal Profiling unwinds stacks without frame pointers and symbols

  • Support profiling of system libraries without frame pointers and without debug symbols on the host

  • Support for mixed stacktraces between runtimes — stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages

  • Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)

  • Support for a broad set of High-level languages (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation

  • 100% non-intrusive: there's no need to load agents or libraries into the processes that are being profiled

  • No need for any reconfiguration, instrumentation, or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration

  • Support for x86 and Arm64 CPU architectures  

  • Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains

  • Support for Probabilistic Profiling to reduce data storage costs

  • . . . and more

Elastic's commitment to enhancing computational efficiency and our belief in the OpenTelemetry vision underscores our dedication to advancing the observability ecosystem –– by donating the profiling agent. Elastic is not only contributing technology but also dedicating a team of specialized profiling domain experts to co-maintain and advance the profiling capabilities within OpenTelemetry.

How does this donation benefit the OTel community?

Metrics, logs, and traces offer invaluable insights into system health. But what if you could unlock an even deeper level of visibility? Here's why profiling is a perfect complement to your OTel toolkit:

1. Deep system visibility: Beyond the surface

Think of whole-system profiling as an MRI scan for your fleet. It goes deeper into the internals of your system, revealing hidden performance issues lurking beneath the surface. You can identify "unknown unknowns" — inefficiencies you wouldn't have noticed otherwise — and gain a comprehensive understanding of how your system functions at its core.

2. Cross-signal correlation: Answering "why" with confidence

The Elastic Universal Profiling agent supports trace correlation with the OTel Java agent/SDK (with Go support coming soon!). This correlation enables OTel users to view profiling data by services or service endpoints, allowing for a more context-aware and targeted root cause analysis. This powerful combination allows you to pinpoint the exact cause of resource consumption at the trace level. No more guessing why specific functions hog CPU or why certain events occur. You can finally answer the critical "why" questions with precision, enabling targeted optimization efforts.

3. Cost and sustainability optimization: Beyond performance

Our approach to profiling goes beyond just performance gains. By correlating whole-system profiling data with tracing, we can help you measure the environmental impact and cloud cost associated with specific services and functionalities within your application. This empowers you to make data-driven decisions that optimize both performance and resource utilization, leading to a more sustainable and cost-effective operation.

A differential function insight, showing the performance, cost, and CO2 impact of a change
A differential function insight, showing the performance, cost, and CO2 impact of a change

Elastic's commitment to OpenTelemetry

Elastic currently supports a growing list of Cloud Native Computing Foundation (CNCF) projects such as Kubernetes (K8S), Prometheus, Fluentd, Fluent Bit, and Istio. Elastic’s application performance monitoring (APM) also natively supports OTel, ensuring all APM capabilities are available with either Elastic or OTel agents or a combination of the two. In addition to the ECS contribution and ongoing collaboration with OTel SemConv, Elastic has continued to make contributions to other OTel projects, including language SDKs (such as OTel Swift, OTel Go, OTel Ruby, and others), and participates in several special interest groups (SIGs) to establish OTel as a standard for observability and security.

We are excited about our strengthening relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community.Learn more about Elastic’s OpenTelemetry support or contribute to the donation proposal or just join the conversation.

Stay tuned for further updates as the profiling part of OTel continues to evolve.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.