Elastic Observability Labs - Articles by Alexander Wert

Introducing Elastic Distribution of OpenTelemetry Collector

Fri, 09 Aug 2024 00:00:00 GMT

OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to reinstrument their observability when switching platforms.

Over the past year, Elastic has made several notable contributions to the OpenTelemetry ecosystem. We donated our Elastic Common Schema (ECS) to OpenTelemetry, successfully integrated the eBPF-based profiling agent, and have consistently been one of the top contributing companies across the OpenTelemetry project. Additionally, Elastic has significantly improved upstream logging capabilities within OpenTelemetry with enhancements to key areas such as container logging, further enhancing the framework’s robustness.

These efforts demonstrate our strategic focus on enhancing and expanding the capabilities of OpenTelemetry for the broader observability community and reinforce the vendor-agnostic benefits of using OpenTelemetry.

Today, we are thrilled to announce the technical preview of the Elastic Distribution of OpenTelemetry Collector. This new offering underscores Elastic’s dedication to this important framework and highlights our ongoing contributions to make OpenTelemetry the best vendor agnostic data collection framework.

Elastic Agent as an OpenTelemetry Collector

Technically, the Elastic Distribution of OpenTelemetry Collector represents an evolution of the Elastic Agent. In its latest version, the Elastic Agent can operate in an OpenTelemetry mode. This mode invokes a module within the Elastic Agent which is essentially a distribution of the OpenTelemetry collector. It is crafted using a selection of upstream components from the contrib distribution.

The Elastic OpenTelemetry Collector also includes configuration for this set of upstream OpenTelemetry Collector components, providing out-of-the-box functionality with Elastic Observability. This integration allows users to seamlessly utilize Elastic’s advanced observability features with minimal setup.

The technical preview version of the Elastic OpenTelemetry Collector has been tailored with out-of-the-box configurations for the below use cases, we will keep working to add more as we progress: :

Collect and ship logs: Use the Elastic OpenTelemetry Collector to gather log data from various sources and ship it directly to Elastic where it can be analyzed in Kibana Discover, and Elastic Observability’s Explorer (also in Tech Preview in 8.15).
Assess host health: Leverage the OpenTelemetry host metrics and Kubernetes receivers to monitor to evaluate the performance of hosts and pods. This data can then be visualized and analyzed in Elastic’s Infrastructure Observability UIs, providing deep insights into host performance and health. Details of how this is configured in the OTel collector is outlined in this blog.
Kubernetes container logs: Additionally, users of the Elastic OpenTelemetry Collector benefit from out-of-the-box Kubernetes container and application logs enriched with Kubernetes metadata by leveraging the powerful container log parser Elastic recently contributed to OTel. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.

While the Elastic OpenTelemetry Collector comes pre-built and preconfigured for the sake of easier onboarding and getting started experience, Elastic is committed to the vision of vendor-neutral collection of data. Thus, we strive to contribute any Elastic specific features back to the upstream OpenTelemetry components, to advance and help grow the OpenTelemetry landscape and capabilities.

Stay tuned for upcoming announcements sharing our plans to combine the best of Elastic Agent and OpenTelemetry Collector.

Get started the Elastic Distribution of OpenTelemetry Collector

To get started with a guided onboarding flow for the Elastic Distribution of the OpenTelemetry Collector for Kubernetes, Linux, and Mac environments, visit the guided onboarding documentation.

For more advanced manual configuration, follow the manual configuration instructions.

Once the Elastic Distribution of the OpenTelemetry Collector is set up and running, you’ll be able to analyze your systems within various features of the Elastic Observability solution.

Analyze the performance and health of your infrastructure, through corresponding metrics and logs collected through OpenTelemetry Collector receivers, such as the host metrics receiver and different Kubernetes receivers.

With Elastic OpenTelemetry Collector, container and application logs are enriched with Kubernetes metadata out-of-the-box making filtering, grouping and logs analysis easier and more efficient.

The Elastic Distribution of the OpenTelemetry Collector allows for tracing just like any other collector distribution made of upstream components. Explore and analyze the performance and runtime behavior of your applications and services through RED metric, service maps and distributed traces collected from OpenTelemetry SDKs.

The above capabilities and features packed with the Elastic OpenTelemetry Collector can be achieved in a similar way with a custom build of the upstream OpenTelemetry Collector packing the right set of upstream components. To do just that follow our guidance here.

Outlook

The launch of the technical preview of the Elastic Distribution of OpenTelemetry Collector is another step on Elastic’s journey towards OpenTelemetry based observability. On that journey we are committed to a vendor-agnostic approach to data collection and therefore prioritize upstream contribution to OpenTelemetry over Elastic-specific data collection features.

Stay tuned to see more of Elastic’s contributions to OpenTelemetry and observe Elastic’s journey towards fully OpenTelemetry-based observability.

Additional resources for OpenTelemetry with Elastic:

Elastic Distributions recently introduced:
Other Elastic OpenTelemetry resources:
Instrumentation resources:
- Python: Auto-instrumentation, Manual instrumentation
- Java: Auto-instrumentation, Manual instrumentation
- Node.js: Auto-instrumentation, Manual instrumentation
- .NET: Auto-instrumentation, Manual instrumentation

Announcing GA of Elastic distribution of the OpenTelemetry Java Agent

Thu, 12 Sep 2024 00:00:00 GMT

As Elastic continues its commitment to OpenTelemetry (OTel), we are excited to announce general availability of the Elastic Distribution of OpenTelemetry Java (EDOT Java). EDOT Java is a fully compatible drop-in replacement for the OTel Java agent that comes with a set of built-in, useful extensions for powerful additional features and improved usability with Elastic Observability. Use EDOT Java to start the OpenTelemetry SDK with your Java application, and automatically capture tracing data, performance metrics, and logs. Traces, metrics, and logs can be sent to any OpenTelemetry Protocol (OTLP) collector you choose.

With EDOT Java you have access to all the features of the OpenTelemetry Java agent plus:

Access to SDK improvements and bug fixes contributed by the Elastic team before the changes are available upstream in OpenTelemetry repositories.
Access to optional features that can enhance OpenTelemetry data that is being sent to Elastic (for example, inferred spans and span stacktrace).

In this blog post, we will explore the rationale behind our unique distribution, detailing the powerful additional features it brings to the table. We will provide an overview of how these enhancements can be utilized with our distribution, the standard OTel SDK, or the vanilla OTel Java agent. Stay tuned as we conclude with a look ahead at our future plans and what you can expect from Elastic contributions to OTel Java moving forward.

Elastic Distribution of OpenTelemetry Java (EDOT Java)

Until now, Elastic users looking to monitor their Java services through automatic instrumentation had two options: the proprietary Elastic APM Java agent or the vanilla OTel Java agent. While both agents offer robust capabilities and have reached a high level of maturity, each has its distinct advantages and limitations. The OTel Java agent provides extensive instrumentation across a broad spectrum of frameworks and libraries, is highly extensible, and natively emits OTel data. Conversely, the Elastic APM Java agent includes several powerful features absent in the OTel Java agent.

Elastic’s distribution of the OTel Java agent aims to bring together the best aspects of the proprietary Elastic Java agent and the OpenTelemetry Java agent. This distribution enhances the vanilla OTel Java agent with a set of additional features realized through extensions, while still being a fully compatible drop-in replacement.

Elastic’s commitment to OpenTelemetry not only focuses on standardizing data collection around OTel but also includes improving OTel components and integrating Elastic's data collection features into OTel. In this vein, our ultimate goal is to contribute as many features from Elastic’s distribution back to the upstream OTel Java agent; our distribution is designed in such a way that the additional features, realized as extensions, work directly with the OTel SDK. This means they can be used independent of Elastic’s distro — either with the Otel Java SDK or with the vanilla OTel Java agent. We’ll discuss these usage patterns further in the sections below.

Features included

The Elastic distribution of the OpenTelemetry Java agent includes a suite of extensions that deliver the features outlined below.

Inferred spans

In a recent blog post, we introduced inferred spans, a powerful feature designed to enhance distributed traces with additional profiling-based spans.

Inferred spans (blue spans labeled “internal” in the above image) offer valuable insights into sources of latency within the code that might remain uncaptured by purely instrumentation-based traces. In other words, they fill in the gaps between instrumentation-based traces. The Elastic distribution of the OTel Java agent includes the inferred spans feature. It can be enabled by setting the following environment variable.

ELASTIC_OTEL_INFERRED_SPANS_ENABLED=true

Correlation with profiling

With OpenTelemetry embracing profiling and Elastic's proposal to donate its eBPF-based, continuous profiling agent, a new frontier opens up in correlating distributed traces with continuous profiling data. This integration offers unprecedented code-level insights into latency issues and CO2 emission footprints, all within a clearly defined service, transaction, and trace context. To get started, follow this guide to setup universal profiling and the OpenTelemetry integration. In order to get more background information on the feature, check out this blog article, where we explore how these technologies converge to enhance observability and environmental consciousness in software development.

Users of Elastic Universal Profiling can already leverage the Elastic distribution of the OTel Java agent to access this powerful integration. With Elastic's proposed donation of the profiling agent, we anticipate that this capability will soon be available to all OTel users who employ the OTel Java agent in conjunction with the new OTel eBPF profiling.

Span stack traces

In many cases, spans within a distributed trace are relatively coarse-grained, particularly when features like inferred spans are not used. Understanding precisely where in the code path a span originates can be incredibly valuable. To address this need, the Elastic distribution of the OTel Java agent includes the span stack traces feature. This functionality provides crucial insights by collecting corresponding stack traces for spans that exceed a configurable minimum duration, pinpointing exactly where a span is initiated in the code.

This simple yet powerful feature significantly enhances problem troubleshooting, offering developers a clearer understanding of their application’s performance dynamics.

In the example above, it allows you to get the call stack of a gRPC call, which can help understanding which code paths triggered it.

Auto-detection of service and cloud resources

In today's expansive and diverse cloud environments, which often include multiple regions and cloud providers, having information on where your services are operating is incredibly valuable. Particularly in Java services, where the service name is frequently embedded within the deployment artifacts, the ability to automatically retrieve service and cloud resource information marks a substantial leap in usability.

To address this need, the Elastic distribution of the OTel Java agent includes built-in auto detectors for service and cloud resources, specifically for AWS and GCP, sourced from the OpenTelemetry Java Contrib repository. This feature, which is on by default, enhances observability and streamlines the management of services across various cloud platforms, making it a key asset for any cloud-based deployment.

Ways to use the EDOT Java

The Elastic distribution of the OTel Java agent is designed to meet our users exactly where they are, accommodating a variety of needs and strategic approaches. Whether you're looking to fully integrate new observability features or simply enhance existing setups, the Elastic distribution offers multiple technical pathways to leverage its capabilities. This flexibility ensures that users can tailor the agent's implementation to align perfectly with their specific operational requirements and goals.

Using Elastic’s distribution directly

The most straightforward path to harnessing the capabilities described above is by adopting the Elastic distribution of the OTel Java agent as a drop-in replacement for the standard OTel Java agent. Structurally, the Elastic distro functions as a wrapper around the OTel Java agent, maintaining full compatibility with all upstream configuration options and incorporating all its features. Additionally, it includes the advanced features described above that significantly augment its functionality. Users of the Elastic distribution will also benefit from the comprehensive technical support provided by Elastic, which will commence once the agent achieves general availability. To get started, simply download the agent Jar file and attach it to your application:

java -javaagent:/pathto/elastic-otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions with the vanilla OTel Java agent

If you prefer to continue using the vanilla OTel Java agent but wish to take advantage of the features described above, you have the flexibility to do so. We offer a separate agent extensions package specifically designed for this purpose. To integrate these enhancements, simply download and place the extensions jar file into a designated directory and configure the OTel Java agent extensions directory:

OTEL_JAVAAGENT_EXTENSIONS=/pathto/elastic-otel-agentextension.jar
java -javaagent:/pathto/otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions manually with the OTel Java SDK

If you build your instrumentations directly into your applications using the OTel API and rely on the OTel Java SDK instead of the automatic Java agent, you can still use the features we've discussed. Each feature is designed as a standalone component that can be integrated with the OTel Java SDK framework. To implement these features, simply refer to the specific descriptions for each one to learn how to configure the OTel Java SDK accordingly:

Setting up the inferred spans feature with the SDK
Setting up profiling correlation with the SDK
Setting up the span stack traces feature with the SDK
Setting up resource detectors with the SDK

This approach ensures that you can tailor your observability tools to meet your specific needs without compromising on functionality.

Future plans and contributions

We are committed to OpenTelemetry, and our contributions to the OpenTelemetry Java project will continue without limit. Not only are we focused on general improvements within the OTel Java project, but we are also committed to ensuring that the features discussed in this blog post become official extensions to the OpenTelemetry Java SDK/Agent and are included in the OpenTelemetry Java Contrib repository. We have already contributed the span stack trace feature and initiated the contribution of the inferred spans feature, and we are eagerly anticipating the opportunity to add the profiling correlation feature following the successful integration of Elastic’s profiling agent.

Moreover, our efforts extend beyond the current enhancements; we are actively working to port more features from the Elastic APM Java agent to OpenTelemetry. A particularly ambitious yet thrilling endeavor is our project to enable dynamic configurability of the OpenTelemetry Java agent. This future enhancement will allow for the OpenTelemetry Agent Management Protocol (OpAMP) to be used to remotely and dynamically configure OTel Java agents, improving their adaptability and ease of use.

We encourage you to experience the new Elastic distribution of the OTel Java agent and share your feedback with us. Your insights are invaluable as we strive to enhance the capabilities and reach of OpenTelemetry, making it even more powerful and user-friendly.

Check out more information on Elastic Distributions of OpenTelemetry in github and our latest EDOT Blog

Elastic provides the following components of EDOT:

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Introducing Elastic Distributions of OpenTelemetry

Thu, 15 Aug 2024 00:00:00 GMT

We are announcing the availability of Elastic Distributions of OpenTelemetry (EDOT). These Elastic distributions, currently in tech preview, have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic.

The Elastic Distributions of OpenTelemetry (EDOT) are composed of OpenTelemetry (OTel) project components, OTel Collector, and language SDKs, which provide users with the necessary capabilities and out-of-the-box configurations, enabling quick and effortless infra and application monitoring.

While OTel components are feature-rich, enhancements through the community can take time. Additionally, support is left up to the community or individual users and organizations. Hence EDOT will bring the following to end users:

Deliver enhanced features earlier than OTel: By providing features unavailable in the “vanilla” OpenTelemetry components, we can quickly meet customers’ requirements while still providing an OpenTelemetry native and vendor-agnostic instrumentation for their applications. Elastic will continuously upstream these enhanced features.
Enhanced OTel support - By maintaining Elastic distributions, we can better support customers with enhancements and fixes outside of the OTel release cycles. In addition, Elastic support can troubleshoot issues on the EDOT.

EDOT currently includes the following tech preview components, which will grow over time:

Details and documentation for all EDOT are available in our public OpenTelemetry GitHub repository.

Elastic Distribution of OpenTelemetry (EDOT) Collector

The EDOT Collector, recently released with the 8.15 release of Elastic Observability enhances Elastic’s existing OTel capabilities. The EDOT Collector can, in addition to service monitoring, forward application logs, infrastructure logs, and metrics using standard OpenTelemetry Collector receivers like file logs and host metrics receivers.

Additionally, users of the Elastic Distribution of the OpenTelemetry Collector benefit from container logs automatically enriched with Kubernetes metadata by leveraging the powerful container log parser that Elastic recently contributed. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.

This new collector distribution ensures that exported data is fully compatible with the Elastic Platform, enhancing the overall observability experience. Elastic also ensures that Elastic-curated UIs can seamlessly handle both the Elastic Common Schema (ECS) and OpenTelemetry formats.

Elastic Distributions for Language SDKs

Elastic's APM agents have capabilities yet to be available in the OTel SDKs. EDOT brings these capabilities into the OTel language SDKs while maintaining seamless integration with Elastic Observability. Elastic will release OTel versions of all its APM agents, and continue to add additional language SDKs mirroring OTel.

Continued support for Native OTel components

EDOT does not preclude users from using native components. Users are still able to use:

OpenTelemetry Vanilla Language SDKs: use standard OpenTelemetry code instrumentation for many popular programming languages sending OTLP traces to Elastic via APM server.
Upstream Distribution of OpenTelemetry Collector (Contrib or Custom): Send traces using the OpenTelemetry Collector with OTLP receiver and OTLP exporter to Elastic via APM server.

Elastic is committed to contributing EDOT features or components upstream into the OpenTelemetry community, fostering a collaborative environment, and enhancing the overall OpenTelemetry ecosystem.

Extending our commitment to vendor-agnostic data collection

Elastic remains committed to supporting OpenTelemetry by being OTel first and building a vendor-agnostic framework. As OpenTelemetry constantly grows its support of SDKs and components, Elastic will continue to refine and mirror EDOT to OpenTelemetry and push enhancements upstream.

Over the past year, Elastic has been active in OTel through its donation of Elastic Common Schema (ECS), contributions to the native OpenTelemetry Collector and language SDKs, and a recent donation of its Universal Profiling agent to OpenTelemetry.

EDOT builds on our decision to fully adopt and recommend OpenTelemetry as the preferred solution for observing applications. With EDOT, Elastic customers can future-proof their investments and adopt OpenTelemetry, giving them vendor-neutral instrumentation with Elastic enterprise-grade support.

Our vision is that Elastic will work with the OpenTelemetry community to donate features through the standardization processes and contribute the code to implement those in the native OpenTelemetry components. In time, as OTel capabilities advance, and many of the Elastic-exclusive features transition into OpenTelemetry, we look forward to no longer having Elastic Distributions for OpenTelemetry.. In the meantime, we can deliver those capabilities via our OpenTelemetry distributions.

OpenTelemetry and Elastic: Working together to establish continuous profiling for the community

Tue, 12 Mar 2024 00:00:00 GMT

Profiling is emerging as a core pillar of observability, aptly dubbed the fourth pillar, with the OpenTelemetry (OTel) project leading this essential development. This blog post dives into the recent advancements in profiling within OTel and how Elastic® is actively contributing toward it.

At Elastic, we’re big believers in and contributors to the OpenTelemetry project. The project’s benefits of flexibility, performance, and vendor agnosticism have been making their rounds; we’ve seen a groundswell of customer interest.

To this end, after donating our Elastic Common Schema and our invokedynamic based java agent approach, we recently announced our intent to donate our continuous profiling agent — a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols, or service restarts.

Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging eBPF, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.

Enabling profiling in OpenTelemetry: A step toward unified observability

Elastic actively participates in the OTel community, particularly within the Profiling Special Interest Group (SIG). This group has been instrumental in defining the OTel Profiling Data Model, a crucial step toward standardizing profiling data.

The recent merger of the OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP) marks a significant milestone. With the standardization of profiles as a core observability pillar alongside metrics, tracing, and logs, OTel offers a comprehensive suite of observability tools, empowering users to gain a holistic view of their applications' health and performance.

In line with this advancement, we are donating our whole-system, eBPF-based continuous profiling agent to OTel. In parallel, we are implementing the experimental OTel Profiling signal in the profiling agent, to ensure and demonstrate OTel protocol compatibility in the agent and prepare it for a fully OTel-based collection of profiling signals and correlate it to logs, metrics, and traces.

Why is Elastic donating the eBPF-based profiling agent to OpenTelemetry?

Computational efficiency has always been a critical concern for software professionals. However, in an era where every line of code affects both the bottom line and the environment, there's an additional reason to focus on it. Elastic is committed to helping the OpenTelemetry community enhance computational efficiency because efficient software not only reduces the cost of goods sold (COGS) but also reduces carbon footprint.

We have seen firsthand — both internally and from our customers' testimonials — how profiling insights aid in enhancing software efficiency. This results in an improved customer experience, lower resource consumption, and reduced cloud costs.

Moreover, adopting a whole-system profiling strategy, such as Elastic Universal Profiling, differs significantly from traditional instrumentation profilers that focus solely on runtime. Elastic Universal Profiling provides whole-system visibility, profiling not only your own code but also third-party libraries, kernel operations, and other code you don't own. This comprehensive approach facilitates rapid optimizations by identifying non-optimal common libraries and uncovering "unknown unknowns" that consume CPU cycles. Often, a tipping point is reached when the resource consumption of libraries or certain daemon processes exceeds that of the applications themselves. Without system-wide profiling, along with the capabilities to slice data per service and aggregate total usage, pinpointing these resource-intensive components becomes a formidable challenge.

At Elastic, we have a customer with an extensive cloud footprint who plans to negotiate with their cloud provider to reclaim money for the significant compute resource consumed by the cloud provider's in-VM agents. These examples highlight the importance of whole-system profiling and the benefits that the OpenTelemetry community will gain if the donation proposal is accepted.

Specifically, OTel users will gain access to a lightweight, battle-tested production-grade continuous profiling agent with the following features:

Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing, and the agent typically manages to stay way below that)
Support for native C/C++ executables without the need for DWARF debug information by leveraging .eh_frame data, as described in “How Universal Profiling unwinds stacks without frame pointers and symbols”
Support profiling of system libraries without frame pointers and without debug symbols on the host
Support for mixed stacktraces between runtimes — stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages
Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)
Support for a broad set of High-level languages (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation
100% non-intrusive: there's no need to load agents or libraries into the processes that are being profiled
No need for any reconfiguration, instrumentation, or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration
Support for x86 and Arm64 CPU architectures
Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains
Support for Probabilistic Profiling to reduce data storage costs
. . . and more

Elastic's commitment to enhancing computational efficiency and our belief in the OpenTelemetry vision underscores our dedication to advancing the observability ecosystem –– by donating the profiling agent. Elastic is not only contributing technology but also dedicating a team of specialized profiling domain experts to co-maintain and advance the profiling capabilities within OpenTelemetry.

How does this donation benefit the OTel community?

Metrics, logs, and traces offer invaluable insights into system health. But what if you could unlock an even deeper level of visibility? Here's why profiling is a perfect complement to your OTel toolkit:

1. Deep system visibility: Beyond the surface

Think of whole-system profiling as an MRI scan for your fleet. It goes deeper into the internals of your system, revealing hidden performance issues lurking beneath the surface. You can identify "unknown unknowns" — inefficiencies you wouldn't have noticed otherwise — and gain a comprehensive understanding of how your system functions at its core.

2. Cross-signal correlation: Answering "why" with confidence

The Elastic Universal Profiling agent supports trace correlation with the OTel Java agent/SDK (with Go support coming soon!). This correlation enables OTel users to view profiling data by services or service endpoints, allowing for a more context-aware and targeted root cause analysis. This powerful combination allows you to pinpoint the exact cause of resource consumption at the trace level. No more guessing why specific functions hog CPU or why certain events occur. You can finally answer the critical "why" questions with precision, enabling targeted optimization efforts.

3. Cost and sustainability optimization: Beyond performance

Our approach to profiling goes beyond just performance gains. By correlating whole-system profiling data with tracing, we can help you measure the environmental impact and cloud cost associated with specific services and functionalities within your application. This empowers you to make data-driven decisions that optimize both performance and resource utilization, leading to a more sustainable and cost-effective operation.

Elastic's commitment to OpenTelemetry

Elastic currently supports a growing list of Cloud Native Computing Foundation (CNCF) projects such as Kubernetes (K8S), Prometheus, Fluentd, Fluent Bit, and Istio. Elastic’s application performance monitoring (APM) also natively supports OTel, ensuring all APM capabilities are available with either Elastic or OTel agents or a combination of the two. In addition to the ECS contribution and ongoing collaboration with OTel SemConv, Elastic has continued to make contributions to other OTel projects, including language SDKs (such as OTel Swift, OTel Go, OTel Ruby, and others), and participates in several special interest groups (SIGs) to establish OTel as a standard for observability and security.

We are excited about our strengthening relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community.Learn more about Elastic’s OpenTelemetry support or contribute to the donation proposal or just join the conversation.

Stay tuned for further updates as the profiling part of OTel continues to evolve.

Elastic's contribution: Invokedynamic in the OpenTelemetry Java agent

Thu, 19 Oct 2023 00:00:00 GMT

As the second largest and active Cloud Native Computing Foundation (CNCF) project, OpenTelemetry is well on its way to becoming the ubiquitous, unified standard and framework for observability. OpenTelemetry owes this success to its comprehensive and feature-rich toolset that allows users to retrieve valuable observability data from their applications with low effort. The OpenTelemetry Java agent is one of the most mature and feature-rich components in OpenTelemetry’s ecosystem. It provides automatic instrumentation for JVM-based applications and comes with a broad coverage of auto-instrumentation modules for popular Java-frameworks and libraries.

The original instrumentation approach used in the OpenTelemetry Java agent left the maintenance and development of auto-instrumentation modules subject to some restrictions. As part of our reinforced commitment to OpenTelemetry, Elastic® helps evolve and improve OpenTelemetry projects and components. Elastic’s contribution of the Elastic Common Schema to OpenTelemetry was an important step for the open-source community. As another step in our commitment to OpenTelemetry, Elastic started contributing to the OpenTelemetry Java agent.

Elastic’s invokedynamic-based instrumentation approach

To overcome the above-mentioned limitations in developing and maintaining auto-instrumentation modules in the OpenTelemetry Java agent, Elastic started contributing its invokedynamic-based instrumentation approach to the OpenTelemetry Java agent in July 2023.

To explain the improvement, you should know that in Java, a common approach to do auto-instrumentation of applications is through utilizing Java agents that do bytecode instrumentation at runtime. Byte Buddy is a popular and widespread utility that helps with bytecode instrumentation without the need to deal with Java’s bytecode directly. Instrumentation logic that collects observability data from the target application’s code lives in so-called advice methods. Byte Buddy provides different ways of hooking these advice methods into the target application’s methods:

Advice inlining: The advice method’s code is being copied into the instrumented target method.
Static advice dispatching: The instrumented target method invokes static advice methods that need to be visible by the instrumented code.
Advice dispatching with _ invokedynamic __:_ The instrumented target method uses the JVM’s invokedynamic bytecode instruction to call advice methods that are isolated from the instrumented code.

These different approaches are described in great detail in our related blog post on Elastic’s Java APM agent using invokedynamic. In a nutshell, both approaches, advice inlining and dispatching to static advice methods come with some limitations with respect to writing and maintaining the advice code. So far, the OpenTelemetry Java agent has used advice inlining for its bytecode instrumentation. The resulting limitations on developing instrumentations are documented in corresponding developer guidelines. Among other things, the limitation of not being able to debug advice code is a painful restriction when developing and maintaining instrumentation code.

Elastic’s APM Java agent has been using the invokedynamic approach with its benefits for years — field-proven by thousands of customers. To help improve the OpenTelemetry Java agent, Elastic started contributing the invokedynamic approach with the goal to simplify and improve the development and maintainability of auto-instrumentation modules. The contribution proposal and the implementation outline is documented in more detail in this GitHub issue.

With the new approach in place, Elastic will help migrate existing instrumentations so the OTel Java community can benefit from the invokedynamic -based instrumentation approach.

Elastic supports OTel natively, and has numerous capabilities to help you analyze your application with OTel.

Native OpenTelemetry support in Elastic Observability

Best Practices for instrumenting OpenTelemetry

Independence with OpenTelemetry on Elastic

Instrumenting with OpenTelemetry:

Elastiflix application, a guide to instrument different languages with OpenTelemetry (this is the application the team built to highlight all the languages below)

Python: Auto-instrumentation, Manual instrumentation

Java: Auto-instrumentation, Manual instrumentation

Node.js: Auto-instrumentation, Manual instrumentation

.NET: Auto-instrumentation, Manual instrumentation
Go: Manual instrumentation

Elastic contributes its Universal Profiling agent to OpenTelemetry

Thu, 06 Jun 2024 00:00:00 GMT

Following great collaboration between Elastic and OpenTelemetry's profiling community, which included a thorough review process, the OpenTelemetry community has accepted Elastic's donation of our continuous profiling agent. This marks a significant milestone in helping establish profiling as the fourth telemetry signal in OpenTelemetry. Elastic’s eBPF-based continuous profiling agent observes code across different programming languages and runtimes, third-party libraries, kernel operations, and system resources with low CPU and memory overhead in production. SREs can now benefit from these capabilities: quickly identifying performance bottlenecks, maximizing resource utilization, reducing carbon footprint, and optimizing cloud spend. Over the past year, we have been instrumental in enhancing OpenTelemetry's Semantic Conventions with the donation of Elastic Common Schema (ECS), contributing to the OpenTelemetry Collector and language SDKs, and have been working with OpenTelemetry’s Profiling Special Interest Group (SIG) to lay the foundation necessary to make profiling stable.

With today’s acceptance, we are officially contributing our continuous profiler technology to OpenTelemetry. We will also dedicate a team of profiling domain experts to co-maintain and advance the profiling capabilities within OTel.

We want to thank the OpenTelemetry community for the great and constructive cooperation on the donation proposal. We look forward to jointly establishing continuous profiling as an integral part of OpenTelemetry.

What is continuous profiling?

Profiling is a technique used to understand the behavior of a software application by collecting information about its execution. This includes tracking the duration of function calls, memory usage, CPU usage, and other system resources.

However, traditional profiling solutions have significant drawbacks limiting adoption in production environments:

Significant cost and performance overhead due to code instrumentation
Disruptive service restarts
Inability to get visibility into third-party libraries

Unlike traditional profiling, which is often done only in a specific development phase or under controlled test conditions, continuous profiling runs in the background with minimal overhead. This provides real-time, actionable insights without replicating issues in separate environments. SREs, DevOps, and developers can see how code affects performance and cost, making code and infrastructure improvements easier.

Contribution of production-grade features

Elastic Universal Profiling is a whole-system, always-on, continuous profiling solution that eliminates the need for code instrumentation, recompilation, on-host debug symbols or service restarts. Leveraging eBPF, Elastic Universal Profiling profiles every line of code running on a machine, including application code, kernel, and third-party libraries. The solution measures code efficiency in three dimensions, CPU utilization, CO2, and cloud cost, to help organizations manage efficient services by minimizing computational waste.

The Elastic profiling agent facilitates identifying non-optimal code paths, uncovering "unknown unknowns", and provides comprehensive visibility into the runtime behavior of all applications. Elastic’s continuous profiling agent supports various runtimes and languages, such as C/C++, Rust, Zig, Go, Java, Python, Ruby, PHP, Node.js, V8, Perl, and .NET.

Additionally, organizations can meet sustainability objectives by minimizing computational wastage, ensuring seamless alignment with their strategic ESG goals.

Benefits to OpenTelemetry

This contribution not only boosts the standardization of continuous profiling for observability but also accelerates the practical adoption of profiling as the fourth key signal in OTel. Customers get a vendor-agnostic way of collecting profiling data and enabling correlation with existing signals, like tracing, metrics, and logs, opening new potential for observability insights and a more efficient troubleshooting experience.

OTel-based continuous profiling unlocks the following possibilities for users:

Improved customer experience: delivering consistent service quality and performance through continuous profiling ensures customers have an application that performs optimally, remains responsive, and is reliable.

Maximize gross margins: Businesses can optimize their cloud spend and improve profitability by reducing the computational resources needed to run applications. Whole system continuous profiling identifies the most expensive functions (down to the lines of code) across diverse environments that may span multiple cloud providers. In the cloud context, every CPU cycle saved translates to money saved.

Minimize environmental impact: energy consumption associated with computing is a growing concern (source: MIT Energy Initiative ). More efficient code translates to lower energy consumption, reducing carbon (CO2) footprint.

Accelerate engineering workflows: continuous profiling provides detailed insights to help troubleshoot complex issues faster, guide development, and improve overall code quality.

Improved vendor neutrality and increased efficiency: an OTel eBPF-based profiling agent removes the need to use proprietary APM agents and offers a more efficient way to collect profiling telemetry.

With these benefits, customers can now manage the overall application’s efficiency on the cloud while ensuring their engineering teams optimize it.

What comes next?

While the acceptance of Elastic’s donation of the profiling agent marks a significant milestone in the evolution of OTel’s eBPF-based continuous profiling capabilities, it represents the beginning of a broader journey. Moving forward, we will continue collaborating closely with the OTel Profiling and Collector SIGs to ensure seamless integration of the profiling agent within the broader OTel ecosystem. During this phase, users can test early preview versions of the OTel profiling integration by following the directions in the otel-profiling-agent repository.

Elastic remains deeply committed to OTel’s vision of enabling cross-signal correlation. We plan to further contribute to the community by sharing our innovative research and implementations, specifically those facilitating the correlation between profiling data and distributed traces, across several OTel language SDKs and the profiling agent.

We are excited about our growing relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community. Learn more about Elastic’s OpenTelemetry support and learn how to contribute to the ongoing profiling work in the community.

Additional Resources

Additional details on Elastic’s Universal Profiling can be found in the FAQ.

For insights into observability, visit Observability labs where OTel specific articles are also available.

Elastic's collaboration with OpenTelemetry on improving the filelog receiver

Mon, 17 Jun 2024 00:00:00 GMT

As the newest generally available signal in OpenTelemetry (OTel), logging support currently lags behind tracing and metrics in terms of feature scope and maturity. At Elastic, we bring years of extensive experience with logging use cases and the challenges they present. Committed to advancing OpenTelemetry's logging capabilities, we have focused on enhancing its logging functionalities.

Over the past few months, we have dealt with the capabilities of the filelog receiver in the OpenTelemetry Collector, leveraging our expertise as the Filebeat's maintainers to help refine and expand its potential. Our goal is to contribute meaningfully to the evolution of OpenTelemetry's logging features, ensuring they meet the high standards required for robust observability.

Specifically, we focused on verifying that the receiver is well covered for cases and aspects that have been a pain for us in the past with Filebeat — such as fail-over handling, self-telemetry, test coverage, documentation and usability. Based on our exploration, we started insightful conversations with the OTel project's maintainers, sharing our thoughts and any suggestions that could be useful from our experience. Moreover, we've started putting up PRs to add documentation, make enhancements, improve tests, fix bugs, and even implement completely new features.

In this blog post we'll provide a sneak preview of the work that we've done so far in collaboration with the OpenTelemetry community and what's coming next as we continue to explore ways to improve the OpenTelemetry Collector for log collection.

Enhancing the filelog receiver's telemetry

Observability tools are software components like any other and, thus, need to be monitored as any other software to be able to debug problems and tune relevant settings. In particular, users of the filelog receiver will want to know how it's performing. It's important that the filelog receiver emits sufficient telemetry data for common troubleshooting and optimization use cases. This includes sufficient logging and observable metrics providing insights into the filelog receiver's internal state.

While the filelog receiver already provided a good set of self-telemetry data, we identified some areas of improvement. In particular, we contributed functionality to emit self-telemetry logs on crucial events like when log files are discovered, moved or truncated. Another contribution includes observable metrics about filelog’s receiver internal state about how many files are opened and being harvested. You can find more information on the respective tracking issue.

Improving the Kubernetes container logs parsing

The filelog receiver has been able to parse Kubernetes container logs for some time now. However, properly parsing logs from Kubernetes Pods required a fair bit of configuration to deal with different runtime formats and to extract important meta information, such as k8s.pod.name, k8s.container.name, etc. With this in mind we proposed to abstract these complex set of configuration into a simpler implementation specific container parser and contributed this new feature to the filelog receiver. With that new feature, setting up logs collection for Kubernetes is by magnitudes easier - with only eight lines of configuration vs. ~ 80 lines of configuration before.

You can learn more about the details of the new container logs parser in the corresponding OpenTelemetry blog post.

Evaluating test coverage

Logs collection from files can run into different unexpected scenarios such as restarts, overload and error scenarios. To ensure reliable and consistent collection of logs, it's important to ensure tests cover these kind of scenarios. Based on our experience with testing Filebeat, we evaluated the existing filelog receiver tests with respect to those scenarios. While most of the use cases and scenarios were well-tested already, we identified a few scenarios to improve tests for to ensure reliable logs collection.
At the creation time of this blog posts we were working on contributing additional tests to address the identified test coverage gaps. You can learn more about it in this GitHub issue.

Persistence evaluation

Another important aspect for log collection that we often hear from Elastic's log users are the failover handling capabilities and the delivery guarantees for logs. Some logging use cases, for example audit logging, have strict delivery guarantee requirements. Hence, it's important that the filelog receiver provides functionality to reliably handle situations, such as temporary unavailability of the logging backend or unexpected restarts of the OTel Collector.

Overall, the filelog receiver already has corresponding functionality to deal with such situations. However, user documentation on how to setup reliable logs collection with tangible examples was an area with potential for improvement.

In this regard, beyond verifying the persistence and offset tracking capabilities we worked on improving respective documentation 1 2 and also are collaborating on a community reported issue to ensure delivery guarantees for logs.

Helping users help themselves

Elastic has a long and varied history of supporting customers who use our products for log ingestion. Drawing from this experience, we've proposed a couple of documentation improvements to the OpenTelemetry Collector to help logging users get out of some tricky situations.

Documenting the structure of the tracking file

For every log file the filelog receiver ingests, it needs to track how far into the file it has already read, so it knows where to start reading from when new contents are added to the file. By default, the filelog receiver doesn't persist this tracking information to disk, but it can be configured to do so. We felt it would be useful to document the structure of this tracking file. When ingestion stops unexpectedly, peeking into this tracking file can often provide clues as to where the problem may lie.

Challenges with symlink target changes

The filelog receiver periodically refreshes its memory of the files it's supposed to be ingesting. The interval at which these refreshes happen is controlled by the poll_interval setting. In certain setups log files being ingested by the filelog receiver are symlinks pointing to actual files. Moreover, these symlinks can be updated to point to newer files over time. If the symlink target changes twice before the filelog receiver has had a chance to refresh its memory, it will miss the first change and therefore not ingest the corresponding target file. We've documented this edge case, suggesting the users with such setups should make sure they set poll_interval to a sufficiently low value.

Planning ahead for the receiver's GA

Last but not least, we have raised the topic of making the filelog receiver a generally available (GA) component. For users it's important to be able to rely on the stability of used functionality, hence, not being required to deal with the risk of breaking changes through minor version updates. In this regard, for the filelog receiver we have kicked off a first plan with the maintainers to mark any issue that is a blocker for stability with a required_for_ga label. Once the OpenTelemetry collector goes to version v1.0.0 we will be able to also work towards the specific receiver’s GA.

Conclusion

Overall, OTel's filelog receiver component is in a good shape and provides important functionality for most log collection use cases. Where there are still minor gaps or need for improvement with the filelog receiver, we are gladly to contribute our expertise and experience from Filebeat use cases. The above is just the beginning of our effort to help advancing the OpenTelemetry Collector, and specifically for log collection, get closer to a stable version. Moreover, we are happy to help the filelog receiver maintainers with general maintenance of the component, hence, dealing with community issues and PRs, jointly working on the component's roadmap, etc.

We'd like to thank the OTel Collector group and, in particular, Daniel Jaglowski for the great and constructive collaboration on the filelog receiver, so far!

Stay tuned to learn more about our future contributions and involvement in OpenTelemetry.

Bridging the Gap: End-to-End Observability from Cloud Native to Mainframe

Sun, 01 Feb 2026 00:00:00 GMT

Introduction:

OpenTelemetry is emerging as the standard for modern observability. As a highly active project within the Cloud Native Computing Foundation (CNCF)—second only to Kubernetes—it has become the monitoring solution of choice for cloud-native applications. OpenTelemetry provides a unified method for collecting traces, metrics, and logs across Kubernetes, microservices, and infrastructure.

However, for many enterprises—especially in banking, insurance, healthcare, and government—the reality is more complex than just “cloud native.” Although most organizations have deployed mobile apps and adopted microservices architectures, much of their critical core processing still relies on IBM mainframe applications. These systems process credit card swipes, financial transactions, patient records, and premium calculations.

This creates a dilemma: while the modern distributed systems of the hybrid environment are well-observed, the critical backend remains a black box.

The “Broken Trace”

A common challenge we see with customers involves a request that originates from a modern mobile application. The request hits microservices running on Kubernetes, initiates a service call to the mainframe, and suddenly, visibility stops.

When latency spikes or a transaction fails, Site Reliability Engineers (SREs) are left guessing. Is it the network? The API gateway? Or underlying mainframe applications like CICS? Without a unified, end-to-end view of the services involved—from the frontend Node.js microservices to the backend CICS service—mean time to resolution (MTTR) becomes “mean time to innocence,” with teams simply proving it wasn't their microservice rather than fixing root causes.

We need a unified view where a trace flows seamlessly from a cloud-native frontend (like React) all the way into mainframe transactions.

IBM Z Observability Connect

With the recent release of Z Observability Connect, IBM has introduced OpenTelemetry-native instrumentation into mainframe applications. This creates a bridge between modern cloud-native services and mainframe transactions.

This means the mainframe is no longer a special case; it acts just like any other microservice in a mesh. It functions as an OpenTelemetry data producer, emitting traces, metrics, and logs to OpenTelemetry-compliant backends like Elastic.

The Architecture

The architecture is straightforward:

The Collector: IBM Z Observability Connect runs on z/OS. It collects logs, metrics, or traces and converts them into the OTLP (OpenTelemetry Protocol) format.
The Processor: The Elastic Cloud Managed OTLP Endpoint acts as a gateway collector, providing fully hosted, scalable, and reliable native OTLP ingestion.
The Consumer: Elastic APM enables OpenTelemetry-native application performance monitoring, making it easy to pinpoint and fix performance problems quickly.

Putting it all together in Kubernetes

We deploy an OpenTelemetry Collector within our Kubernetes cluster. This collector acts as a specialized gateway. It is configured to receive OTLP traffic directly from IBM Z Observability Connect on the mainframe and forward it securely to our observability backend, Elastic APM, by using the otlp/elastic exporter.

Here is the configuration for the OpenTelemetry Collector. Note the exporters section, which handles the authentication and batched transmission to Elastic:

exporters:
  # Exporter to print the first 5 logs/metrics and then every 1000th
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 1000

  # Exporter to send logs and metrics to Elasticsearch Managed OTLP Input
  otlp/elastic:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}
    sending_queue:
      enabled: true
      sizer: bytes
      queue_size: 50000000 # 50MB uncompressed
      block_on_overflow: true
    batch:
      flush_timeout: 1s
      min_size: 1_000_000 # 1MB uncompressed
      max_size: 4_000_000 # 4MB uncompressed

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/elastic, debug]

Note: We strongly recommend using environment variables for your endpoints and API keys to keep your manifest secure.

Why the OTel specification matters

Elastic’s managed OTLP endpoint and observability solution is built with native OTel support and adheres to the OTel specification and semantic conventions. Once we wired everything up and the data started to flow, we noticed that some of the traces in Elastic APM were not being represented correctly.

Most observability solutions derive the so-called RED metrics (rate, error, and duration) for the most important spans in a trace—i.e., incoming and outgoing spans of each individual service. This allows for an efficient indication of a service’s performance without the need to comb through all of the tracing data to show something as simple as the latency of a service’s endpoint or the error rate on outgoing requests.

For an efficient calculation of such derived metrics for incoming spans on a service, the OTel community introduced the SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK and SPAN_FLAGS_CONTEXT_IS_REMOTE_MASK flags on the span entities within the OTLP protocol. These flags provide an unambiguous indication of whether an individual span is an entry span and, thus, allow observability backends to efficiently calculate metrics for entry-level spans.

If these flags are set incorrectly for an entry span, the span cannot be recognized as an entry span, and metrics are not derived properly—leading to a broken experience. This is what we initially experienced with the ingested OTel data from the IBM mainframe instrumentation.

In a proprietary world, this might have been a dead end or a months-long troubleshooting exercise. However, since OpenTelemetry is an open standard, we were able to debug the issue rapidly and share our findings with IBM engineers, who quickly developed a fix.

Streamline observability

We now have end-to-end visibility that spans from modern mobile or web applications deep into the IBM mainframe. This unlocks significant value:

Unified Service Maps: You can visually see the dependency between the cloud-native cart service and the backend inventory system on z/OS.
Single Pane of Glass: SREs no longer need to switch between modern observability tools and separate mainframe monitoring tools to view service health.
Operational Efficiency: By eliminating the “blind spot” in the trace, you reduce the time spent on coordinating between cloud and mainframe teams, making issue resolution faster.

Conclusion

If you are running hybrid workloads, it is time to stop treating your mainframe as a black box. With IBM Z Observability Connect, the Elastic Managed OTLP Endpoint, and Elastic APM, your entire stack can finally speak a single language: OpenTelemetry.

How to easily add application monitoring in Kubernetes pods

Wed, 17 Jan 2024 00:00:00 GMT

The Elastic® APM K8s Attacher allows auto-installation of Elastic APM application agents (e.g., the Elastic APM Java agent) into applications running in your Kubernetes clusters. The mechanism uses a mutating webhook, which is a standard Kubernetes component, but you don’t need to know all the details to use the Attacher. Essentially, you can install the Attacher, add one annotation to any Kubernetes deployment that has an application you want monitored, and that’s it!

In this blog, we’ll walk through a full example from scratch using a Java application. Apart from the Java code and using a JVM for the application, everything else works the same for the other languages supported by the Attacher.

Prerequisites

This walkthrough assumes that the following are already installed on the system: JDK 17, Docker, Kubernetes, and Helm.

The example application

While the application (shown below) is a Java application, it would be easily implemented in any language, as it is just a simple loop that every 2 seconds calls the method chain methodA->methodB->methodC->methodD, with methodC sleeping for 10 milliseconds and methodD sleeping for 200 milliseconds. The choice of application is just to be able to clearly display in the Elastic APM UI that the application is being monitored.

The Java application in full is shown here:

package test;

public class Testing implements Runnable {

  public static void main(String[] args) {
    new Thread(new Testing()).start();
  }

  public void run()
  {
    while(true) {
      try {Thread.sleep(2000);} catch (InterruptedException e) {}
      methodA();
    }
  }

  public void methodA() {methodB();}

  public void methodB() {methodC();}

  public void methodC() {
    System.out.println("methodC executed");
    try {Thread.sleep(10);} catch (InterruptedException e) {}
    methodD();
  }

  public void methodD() {
    System.out.println("methodD executed");
    try {Thread.sleep(200);} catch (InterruptedException e) {}
  }
}

We created a Docker image containing that simple Java application for you that can be pulled from the following Docker repository:

docker.elastic.co/demos/apm/k8s-webhook-test

Deploy the pod

First we need a deployment config. We’ll call the config file webhook-test.yaml, and the contents are pretty minimal — just pull the image and run that as a pod & container called webhook-test in the default namespace:

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

This can be deployed normally using kubectl:

kubectl apply -f webhook-test.yaml

The result is exactly as expected:

$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
webhook-test   1/1     Running   0          10s

$ kubectl logs webhook-test
methodC executed
methodD executed
methodC executed
methodD executed

So far, this is just setting up a standard Kubernetes application with no APM monitoring. Now we get to the interesting bit: adding in auto-instrumentation.

Install Elastic APM K8s Attacher

The first step is to install the Elastic APM K8s Attacher. This only needs to be done once for the cluster — once installed, it is always available. Before installation, we will define where the monitored data will go. As you will see later, we can decide or change this any time. For now, we’ll specify our own Elastic APM server, which is at https://myserver.somecloud:443 — we also have a secret token for authorization to that Elastic APM server, which has value MY_SECRET_TOKEN. (If you want to set up a quick test Elastic APM server, you can do so at https://cloud.elastic.co/).

There are two additional environment variables set for the application that are not generally needed but will help when we see the resulting UI content toward the end of the walkthrough (when the agent is auto-installed, these two variables tell the agent what name to give this application in the UI and what method to trace). Now we just need to define the custom yaml file to hold these. On installation, the custom yaml will be merged into the yaml for the Attacher:

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"

That custom.yaml file is all we need to install the attacher (note we’ve only specified the default namespace for agent auto-installation for now — this can be easily changed, as you’ll see later). Next we’ll add the Elastic charts to helm — this only needs to be done once, then all Elastic charts are available to helm. This is the usual helm add repo command, specifically:

helm repo add elastic https://helm.elastic.co

Now the Elastic charts are available for installation (helm search repo would show you all the available charts). We’re going to use “elastic-webhook” as the name to install into, resulting in the following installation command:

helm install elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

And that’s it, we now have the Elastic APM K8s Attacher installed and set to send data to the APM server defined in the custom.yaml file! (You can confirm installation with a helm list -A if you need.)

Auto-install the Java agent

The Elastic APM K8s Attacher is installed, but it doesn’t auto-install the APM application agents into every pod — that could lead to problems! Instead the Attacher is deliberately limited to auto-install agents into deployments defined a) by the namespaces listed in the custom.yaml, and b) to those deployments in those namespaces that have a specific annotation “co.elastic.apm/attach.”

So for now, restarting the webhook-test pod we created above won’t have any different effect on the pod, as it isn’t yet set to be monitored. What we need to do is add the annotation. Specifically, we need to add the annotation using the default agent configuration that was installed with the Attacher called “java” for the Java agent (we’ll see later how that agent configuration is altered — the default configuration installs the latest agent version and leaves everything else default for that version). So adding that annotation in to webhook-test yaml gives us the new yaml file contents (the additional config is shown labelled (1)):

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  annotations: #(1)
    co.elastic.apm/attach: java #(1)
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

Applying this change gives us the application now monitored:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.45.0 …

And since the agent is now feeding data to our APM server, we can now see it in the UI:

Note that the agent identifies Testing.methodB method as a trace root because of the ELASTIC_APM_TRACE_METHODS environment variable set to test.Testing#methodB in the custom.yaml — this tells the agent to specifically trace that method. The time taken by that method will be available in the UI for each invocation, but we don’t see the sub-methods . . . currently. In the next section, we’ll see how easy it is to customize the Attacher, and in doing so we’ll see more detail about the method chain being executed in the application.

Customizing the agents

In your systems, you’ll likely have development, testing, and production environments. You’ll want to specify the version of the agent to use rather than just pull the latest version whatever that is, you’ll want to have debug on for some applications or instances, and you’ll want to have specific options set to specific values. This sounds like a lot of effort, but the attacher lets you enable these kinds of changes in a very simple way. In this section, we’ll add a configuration that specifies all these changes and we can see just how easy it is to configure and enable it.

We start at the custom.yaml file we defined above. This is the file that gets merged into the Attacher. Adding a new configuration with all the items listed in the last paragraph is easy — though first we need to decide a name for our new configuration. We’ll call it “java-interesting” here. The new custom.yaml in full is (the first part is just the same as before, the new config is simply appended):

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
    java-interesting:
      image: docker.elastic.co/observability/apm-agent-java:1.55.4
      artifact: "/usr/agent/elastic-apm-agent.jar"
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
        ELASTIC_APM_ENVIRONMENT: "testing"
        ELASTIC_APM_LOG_LEVEL: "debug"
        ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED: "true"
        JAVA_TOOL_OPTIONS: "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar"

Breaking the additional config down, we have:

The name of the new config java-interesting
The APM Java agent image docker.elastic.co/observability/apm-agent-java
- With a specific version 1.43.0 instead of latest
We need to specify the agent jar location (the attacher puts it here)
- artifact: "/usr/agent/elastic-apm-agent.jar"
And then the environment variables
ELASTIC_APM_SERVER_URL as before
ELASTIC_APM_ENVIRONMENT set to testing, useful when looking in the UI
ELASTIC_APM_LOG_LEVEL set to debug for more detailed agent output
ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED turning this on (setting to true) will give us additional interesting information about the method chain being executed in the application
And lastly we need to set JAVA_TOOL_OPTIONS to the enable starting the agent "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar" — this is fundamentally how the attacher auto-attaches the Java agent

More configurations and details about configuration options are here for the Java agent, and other language agents are also available.

The application traced with the new configuration

And finally we just need to upgrade the attacher with the changed custom.yaml:

helm upgrade elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

This is the same command as the original install, but now using upgrade. That’s it — add config to the custom.yaml and upgrade the attacher, and it’s done! Simple.

Of course we still need to use the new config on an app. In this case, we’ll edit the existing webhook-test.yaml file, replacing java with java-interesting, so the annotation line is now:

co.elastic.apm/attach: java-interesting

Applying the new pod config and restarting the pod, you can see the logs now hold debug output:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.44.0 …
… DEBUG co.elastic.apm.agent. …
… DEBUG co.elastic.apm.agent. …

More interesting is the UI. Now that inferred spans is on, the full method chain is visible.

This gives the details for methodB (it takes 211 milliseconds because it calls methodC - 10ms - which calls methodD - 200ms). The times for methodC and methodD are inferred rather than recorded, (inferred rather than traced — if you needed accurate times you would instead add the methods to trace_methods and have them traced too).

Note on the ECK operator

The Elastic Cloud on Kubernetes operator allows you to install and manage a number of other Elastic components on Kubernetes. At the time of publication of this blog, the Elastic APM K8s Attacher is a separate component, and there is no conflict between these management mechanisms — they apply to different components and are independent of each other.

Try it yourself!

This walkthrough is easily repeated on your system, and you can make it more useful by replacing the example application with your own and the Docker registry with the one you use.

Learn more about real-time monitoring with Kubernetes and Elastic Observability.

Dynamic workload discovery on Kubernetes now supported with EDOT Collector

Tue, 01 Apr 2025 00:00:00 GMT

At Elastic, Kubernetes is one of the most significant observability use cases we focus on. We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.

OpenTelemetry recently published a blog on how to do Autodiscovery based on Kubernetes Pods' annotations with the OpenTelemetry Collector.

In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector, which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.

In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability. You might already have seen us focusing on:

Semantic Conventions standardization
significant log collection improvements
various other topics around instrumentation
profiling

Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.

Configuring EDOT Collector

The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal, letting workloads define how they should be monitored.

To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:

receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

You can include the above in the EDOT’s Collector configuration, specifically the receivers’ section.

Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver configuration block is removed and its preset is disabled (i.e. set to false) to avoid having log duplication.

Make sure that the receiver creator is properly added in the pipelines for logs (in addition to removing the filelog receiver completely) and metrics respectively.

Ensure that k8sobserver is enabled as part of the extensions:

extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]

Last but not least, ensure the log files' volume is mounted properly:

volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods

Once the configuration is ready follow the Kubernetes quickstart guides on how to deploy the EDOT Collector. Make sure to replace the values.yaml file linked in the quickstart guide with the file that includes the above-described modifications.

Collecting Metrics from Moving Targets Based on Their Annotations

In this example, we have a Deployment with a Pod spec that consists of two different containers. One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide different hints for each of these target containers.

The annotation-based discovery feature supports this, allowing us to specify metrics annotations per exposed container port.

Here is how the complete spec file looks:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: "true"
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: "http://`endpoint`/nginx_status"
          collection_interval: "30s"
          timeout: "20s"
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP

When this workload is deployed, the Collector will automatically discover it and identify the specific annotations. After this, two different receivers will be started, each one responsible for each of the target containers.

Collecting Logs from Multiple Target Containers

The annotation-based discovery feature also supports log collection based on the provided annotations. In the example below, we again have a Deployment with a Pod consisting of two different containers, where we want to apply different log collection configurations. We can specify annotations that are scoped to individual container names:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: "true"
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from busybox at $(date +%H:%M:%S)" && sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from lazybox at $(date +%H:%M:%S)" && sleep 25s; done

The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration. This is handy when we know how to parse specific technology logs, such as Apache server access logs.

Combining Both Metrics and Logs Collection

In our third example, we illustrate how to define both metrics and log annotations on the same workload. This allows us to collect both signals from the discovered workload. Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing. We can target annotations to the port and container levels to collect metrics from the Redis server using the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs at $(date +%H:%M:%S)" && sleep 15s; done

Explore and analyse data coming from dynamic targets in Elastic

Once the target Pods are discovered and the Collector has started collecting telemetry data from them, we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as logs collected from the Busybox container. Here is how it looks like:

Summary

The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature — one we played a major role in developing.

For this, we leveraged our years of experience with similar features already supported in Metricbeat, Filebeat, and Elastic-Agent. This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific monitoring agents and the OpenTelemetry Collector — making it even better.

Interested in learning more? Visit the documentation and give it a try by following our EDOT quickstart guide.

Monitoring Android applications with Elastic APM

Tue, 21 Mar 2023 00:00:00 GMT

WARNING: This article shows information about the Android agent that is no longer accurate for versions 1.x. Please refer to its documentation to learn about its new APIs.

People are handling more and more matters on their smartphones through mobile apps both privately and professionally. With thousands or even millions of users, ensuring great monitor application performance and reliability is a key challenge for providers and operators of mobile apps and related backend services. Understanding the behavior of mobile apps, the occurrences and types of crashes, the root causes of slow response times, and the real user impact of backend issues is key to managing the performance of mobile apps and associated backend services.

Elastic has launched its application performance monitoring (APM) agent for Android applications, allowing developers to keep track of key aspects of their applications, from crashes and HTTP requests to screen rendering times and end-to-end distributed tracing. All of this helps troubleshoot issues and performance flaws with mobile applications, corresponding backend services, and their interaction. The Elastic APM Android Agent automatically instruments your application and its dependencies so that you can simply “plug-and-play” the agent into your application without having to worry about changing your codebase much.

The Elastic APM Android Agent has been developed from scratch on top of OpenTelemetry, an open standard and framework for observability. Developers will be able to take full advantage of its capabilities, as well as the support provided by a huge and active community. If you’re familiar with OpenTelemetry and your application is already instrumented with OpenTelemetry, then you can simply reuse it all when switching to the Elastic APM Android Agent. But no worries if that’s not the case — the agent is configured to handle common traceable scenarios automatically without having to deep dive into the specifics of the OpenTelemetry API.

How it works

The Elastic APM Android Agent is a combination of an SDK plus a Gradle plugin. The SDK contains utilities that will let you initialize and configure the agent’s behavior, as well as prepare and initialize the OpenTelemetry SDK. You can use the SDK for programmatic configuration and initialization of the agent, in particular for advanced and special use cases.

In most cases, a programmatic configuration and initialization won’t be necessary. Instead, you can use the provided Gradle plugin to configure the agent and automatically instrument your app. The Gradle plugin uses Byte Buddy and the official Android Gradle plugin API under the hood to automatically inject instrumentation code into your app through compile-time transformation of your application’s and its dependencies’ classes.

Compiling your app with the Elastic Android APM Agent Gradle Plugin configured and enabled will make your Android app report tracing data, metrics, and different events and logs at runtime.

Using the Elastic APM Agent in an Android app

By means of a simple demo application, we’re going through the steps mentioned in the “Set up the Agent” guide to set up the Elastic Android APM Agent.

Prerequisites

For this example, you will need the following:

An Elastic Stack with APM enabled (We recommend using Elastic’s Cloud offering. Try it for free.)
Java 11+
Android Studio
Android Emulator, AVD device

You’ll also need a way to push the app’s signals into Elastic. Therefore, you will need Elastic APM’s secret token that you’ll configure into our sample app later.

Test project for our example

To showcase an end-to-end scenario including distributed tracing, in this example, we’ll instrument a simple weather application that comprises two Android UI fragments and a simple local backend service based on Spring Boot.

The first fragment will have a dropdown list with some city names and also a button that takes you to the second one, where you’ll see the selected city’s current temperature. If you pick a non-European city on the first screen, you’ll get an error from the (local) backend when you head to the second screen. This is to demonstrate how network and backend errors are captured and correlated in Elastic APM.

Applying the Elastic APM Agent plugin

In the following, we will explain all the steps required to set up the Elastic APM Android Agent from scratch for an Android application. In case you want to skip these instructions and see the agent in action right away, use the main branch of that repo and apply only Step (3.b) before continuing with the next Section (“Setting up the local backend service”).

Clone the sample app repo and open it in Android Studio.
Switch to the uninstrumented repo branch to start from a blank, uninstrumented Android application. You can run this command to switch to the uninstrumented branch:

git checkout uninstrumented

Follow the Elastic APM Android Agent’s setup guide:

Add the co.elastic.apm.android plugin to the app/build.gradle file (please make sure to use the latest version available of the plugin, which you can find here).

Configure the agent’s connection to the Elastic APM backend by providing the ‘serverUrl’ and ‘secretToken’ in the ‘elasticAPM’ section of the app/build.gradle file.

// Android app's build.gradle file
plugins {
    //...
    id "co.elastic.apm.android" version "[latest_version]"
}

//...

elasticApm {
    // Minimal configuration
    serverUrl = "https://your.elastic.apm.endpoint"

    // Optional
    serviceName = "weather-sample-app"
    serviceVersion = "0.0.1"
    secretToken = "your Elastic APM secret token"
}

The only actual code change required is a one-liner to initialize the Elastic APM Android Agent in the Application.onCreate method. The application class for this sample app is located at app/src/main/java/co/elastic/apm/android/sample/MyApp.kt.


package co.elastic.apm.android.sample

import android.app.Application
import co.elastic.apm.android.sdk.ElasticApmAgent

class MyApp : Application() {

    override fun onCreate() {
        super.onCreate()
        ElasticApmAgent.initialize(this)
    }
}

Bear in mind that for this example, we’re not changing the agent’s default configuration — if you want more information about how to do so, take a look at the agent’s runtime configuration guide.

Before launching our Android Weather App, we need to configure and start the local weather-backend service as described in the next section.

Setting up the local backend service

One of the key features the agent provides is distributed tracing, which allows you to see the full end-to-end story of an HTTP transaction, starting from our mobile app and traversing instrumented backend services used by the app. Elastic APM will show you the full picture as one distributed trace, which comes in very handy for troubleshooting issues, especially the ones related to high latency and backend errors.

As part of our sample app, we’re going to launch a simple local backend service that will handle our app’s HTTP requests. The backend service is instrumented with the Elastic APM Java agent to collect and send its own APM data over to Elastic APM, allowing it to correlate the mobile interactions with the processing of the backend requests.

In order to configure the local server, we need to set our Elastic APM endpoint and secret token (the same used for our Android app in the previous step) into the backend/src/main/resources/elasticapm.properties file:

service_name=weather-backend
application_packages=co.elastic.apm.android.sample
server_url=YOUR_ELASTIC_APM_URL
secret_token=YOUR_ELASTIC_APM_SECRET_TOKEN

Launching the demo

Our sample app will get automatic instrumentation for the agent’s currently supported frameworks, which means that we’ll get to see screen rendering spans as well as OkHttp requests out of the box. For frameworks not currently supported, you could apply manual instrumentation to enrich your APM data (see “Manual Instrumentation” below).

We are ready to launch the demo. (The demo is meant to be executed on a local environment using an emulator for Android.) Therefore, we need to:

Launch the backend service using this command in a terminal located in the root directory of our sample project: ./gradlew bootRun (or gradlew.bat bootRun if you’re on Windows). Alternatively, you can start the backend service from Android Studio.
Launch the weather sample app in an Android emulator (from Android Studio).

Once everything is running, we need to navigate around in the app to generate some load that we would like to observe in Elastic APM. So, select a city, click Next and repeat it multiple times. Please, also make sure to select New York at least once. You will see that the weather forecast won’t work for New York as the city. Below, we will use Elastic APM to find out what’s going wrong when selecting New York.

First glance at the APM results

Let’s open Kibana and navigate to the Observability solution.

Under the Services navigation item, you should see a list of two services: our Android app weather-sample-app and the corresponding backend service weather-backend. Click on the Service map tab to see a visualization of the dependencies between those services and any external services.

Click on the weather-sample-app to dive into the dashboard for the Android app. The service view for mobile applications is in technical preview at the publishing of this blog post, but you can already see insightful information about the app on that screen. You see information like the amount of active sessions in the selected time frame, number of HTTP requests emitted by the weather-sample-app, geographical distribution of the requests as well as breakdowns on device models, OS versions, network connection types, and app versions. (Information on crashes and app load times are under development.)

For the purpose of demonstration, we kept this demo simple, so the data is less diversified and also rather limited. However, this kind of data is particularly useful when you are monitoring a mobile app with higher usage numbers and higher diversification on device models, OS versions, etc. Troubleshooting problems and performance issues becomes way easier when you can use these properties to filter and group your APM data. You can use the quick filters at the top to do so and see how the metrics adopt depending on your selection.

Now, let’s see how individual user interactions are processed, including downstream calls into the backend service. Under the Transactions tab (at the top), we see the different end-to-end transaction groups, including the two transactions for the FirstFragment and the SecondFragment.

Let’s deep dive into the SecondFragment - View appearing transaction, to see the metrics (e.g., latency, throughput) for this transaction group and also the invocation waterfall view for the individual user interactions. As we can see in the following screenshot, after view creation, the fragment performs an HTTP GET request to 10.0.2.2, which takes ~130 milliseconds. In the same waterfall, we see that the HTTP call is processed by the weather-backend service, which itself conducts an HTTP call to api.open-meteo.com.

Now, when looking at the waterfall view for a request where New York was selected as the city, we see an error happening on the backend service that explains why the forecast didn’t work for New York. By clicking on the red View related error badge, you will get details on the error and the actual root cause of the problem.

The exception message on the weather-backend states that “This service can only retrieve geo locations for European cities!” That’s the problem with selecting New York as the city.

Manual instrumentation

As previously mentioned, the Elastic APM Android Agent does a bunch of automatic instrumentation on your behalf for the supported frameworks; however, in some cases, you might want to get extra instrumentation depending on your app’s use cases. For those cases, you’ve gotten covered by the OpenTelemetry API, which is what the Elastic APM Android Agent is based on. The OpenTelemetry Java SDK contains tools to create custom spans, metrics, and logs, and since it’s the base of the Elastic APM Android Agent, it’s available for you to use without having to add any extra dependencies into your project and without having to configure anything to connect your custom signals to your own Elastic environment either, as the agent does that for you.

The way to start would be by getting OpenTelemetry’s instance like so:

OpenTelemetry openTelemetry = GlobalOpenTelemetry.get();

And then you can follow the instructions from the OpenTelemetry Java documentation in order to create your custom signals. See the following example for the creation of a custom span:

OpenTelemetry openTelemetry = GlobalOpenTelemetry.get();
Tracer tracer = openTelemetry.getTracer("instrumentation-library-name", "1.0.0");
Span span = tracer.spanBuilder("my span").startSpan();

// Make the span the current span
try (Scope ss = span.makeCurrent()) {
  // In this scope, the span is the current/active span
} finally {
    span.end();
}

Conclusion

In this blog post, we demonstrated how you can use the Elastic APM Android Agent to achieve end-to-end observability into your Android-based mobile applications. Setting up the agent is a matter of a few minutes and the provided insights allow you to analyze your app’s performance and its dependencies on backend services. With the Elastic APM Android Agent in place, you can leverage Elastic’s rich APM feature as well as the various possibilities to customize your analysis workflows through custom instrumentation and custom dashboards.

Are you curious? Then try it yourself. Sign up for a free trial on the Elastic Cloud, enrich your Android app with the Elastic APM Android agent as described in this blog, and explore the data in Elastic’s Observability solution.

OpenTelemetry Data Quality Insights with the Instrumentation Score and Elastic

Thu, 06 Nov 2025 00:00:00 GMT

OpenTelemetry adoption is rapidly increasing and more companies rely on OpenTelemetry to collect observability data. While OpenTelemetry offers clear specifications and semantic conventions to guide telemetry data collection, it also introduces significant flexibility. With high flexibility comes high responsibility — many things can go wrong with OTel-based data collection, easily resulting in mediocre or low-quality telemetry. Poor data quality can hinder backend analysis, confuse users, and degrade system performance. To unlock actionable insights from OpenTelemetry data, maintaining high data quality is essential. The Instrumentation Score initiative addresses this challenge by providing a standardized way to measure OpenTelemetry data quality. Although the specification and tooling are still evolving, the underlying concepts are already compelling. In this blog post, I’ll share my experience experimenting with the Instrumentation Score concept and demonstrate how to use the Elastic Stack — utlizing ES|QL, Kibana Task Manager, and Dashboards — to build a POC for data quality analysis based on this approach within Elastic Observability.

Instrumentation Score - The Power of Rule-based Data Quality Analysis

When you first hear the term "Instrumentation Score", your initial reaction might be: "OK, there's a single, percentage-like metric that tells me my instrumentation (i.e. OTel data) has a score of 60 out of 100. So what? How does it help me?"

However, the Instrumentation Score is much more than just a single number. Its power lies in the individual rules from which the score is calculated. The rule definitions' rationale, impact level, and criteria provide an evaluation framework that enables you to drill down into data quality issues and identify specific areas for improvement. Also, the Instrumentation Score specification does not mandate specific tools and implementation details for calculating the score and rule evaluations.

As I explored the Instrumentation Score concepts, I developed the following mental model for deriving actionable insights.

The Score

The score itself is an indicator of the quality of your telemetry data. The lower the number, the more room for improvement with your data quality. In general, if a score falls below 75, you should consider fixing your instrumentation and data collection.

Breakdown by Instrumentation Score Rules

Exploring the evaluation results of individual Instrumentation Score rules will give you insights into what is wrong with your data quality. In addition, the rules' rationales explain why the violation of a rule is problematic.

As an example, let's take the SPA-002 rule:

Description:

Traces do not contain orphan spans.

Rationale:

Orphaned spans indicate potential issues in tracing instrumentation or data integrity. This can lead to incomplete or misleading trace data, hindering effective troubleshooting and performance analysis.

If your data violates the SPA-002 rule, you know what is wrong (i.e. you have broken traces), and the rationale explains why that is an issue (i.e. degraded analysis capabilities).

Breakdown by Services

When you have a large system with hundreds or maybe even thousands of entities (such as services, Kubernetes pods, etc.), a binary signal on all of the data — such as "has a certain rule been passed or not" — is not really actionable. Is the data from all services violating a certain rule, or just a small subset of services?

Breaking down rule evaluation by services (and potentially other entity types) may help you to identify where there are issues with data quality. For example, let's assume only one service — the cart-service — (out of your fifty services) is affected by the violation of rule SPA-002. With that information, you can focus on fixing the instrumentation for the cart-service instead of having to check all fifty services.

Once you know which services (or other entities) violate which Instrumentation Score rules, you're very close to actionable insights. However, there are two more things that I found to be extremely useful for data quality analysis when I was experimenting with the Instrumentation Score evaluation: (1) a quantitative indication of the extent, and (2) concrete examples of rule violation occurrences in your data.

Quantifying the Rule Violation Extent

The Instrumentation Score spec already defines an impact level (e.g. NORMAL, IMPORTANT, CRITICAL) per rule. However, this only covers the "importance" of the rule itself, not the extent of a rule violation. For example, if a single trace (out of a million traces) on your service has an orphan span, technically speaking the rule SPA-002 is violated. But is it really a relevant issue if only one out of a million traces is affected? Probably not. It definitely would be if half of your traces were broken.

Hence, having a quantitative indication of the extent of a rule violation per service — e.g. "40% of your traces violate SPA-002" — would provide additional information on how severe a rule violation actually is.

Tangible Examples

Finally, nothing is as meaningful and self-explanatory as tangible, concrete examples from your own data. If the telemetry data of your cart-service violates SPA-002 (i.e., has traces with orphan spans), wouldn't you want to see a concrete trace from that service that demonstrates the rule violation? Analyzing concrete examples may give you hints about the root cause of broken traces — or, more generally, why your data violates Instrumentation Score rules.

Instrumentation Score with Elastic

The Instrumentation Score spec does not prescribe tool usage or implementation details for the calculation of the score and evaluation of the rules. This allows for integrating the Instrumentation Score concept with whatever backend your OpenTelemetry data is being sent to.

With the goal of building a POC for an end-to-end integration of the Instrumentation Score with Elastic Observability, I combined the powerful capabilities of ES|QL with Kibana's task manager and dashboarding features.

Each Instrumentation Score rule can be formulated as an ES|QL query that covers the steps described above:

rule passed or not
breakdown by services
calculation of the extent
sampling of an example occurrence

Here is an example query for the LOG-002 rule that checks the validity of the severity_number field:

FROM logs-*.otel-* METADATA _id
| WHERE data_stream.type == "logs"
    AND @timestamp > NOW() - 1h
| EVAL no_sev = severity_number IS NULL OR severity_number == 0
| STATS 
    logs_wo_severity = COUNT(*) WHERE no_sev,
    example = SAMPLE(_id, 1) WHERE no_sev,
    total = COUNT(*)
      BY service.name
| EVAL rule_passed = (logs_wo_severity == 0),
    extent = CASE(total != 0, logs_wo_severity / total, 0.0)
| KEEP rule_passed, service.name, example, extent

These rule evaluation queries are wrapped in a Kibana instrumentation-score plugin that utilizes the task manager for regular execution. The instrumentation-score plugin then takes the results from all the evaluation queries for the different rules and calculates the final instrumentation score value (overall and broken down by service) following the Instrumentation Score spec's calculation formula. The resulting instrumentation score values, as well as the rule evaluation results (with the examples and extent) are then stored in separate Elasticsearch indices for consumption.

With the results stored in dedicated Elasticsearch indices, we can build Dashboards to visualize the Instrumentation Score insights and allow users to troubleshoot their data quality issues.

In this POC I implement subet of instrumentation score rules to prove out the approach.

The Instrumentation Score concept accommodates extension with your own custom rules. I did that in my POC as well to test some quality rules that are not yet formalized as rules in the Instrumentation Score spec, but are important for Elastic Observability to provide the maximum value from the OTel data.

Applying the Instrumentation Score on the OpenTelemetry Demo

The OpenTelemetry Demo is the most-used environment to play around with and showcase OpenTelemetry capabilities. Initially, I thought the demo would be the worst environment to test my Instrumentation Score implementation. After all, it's the showcase environment for OpenTelemetry, and I expected it to have an Instrumentation Score close to 100. Surprisingly, that wasn't the case.

Let's start with the overview.

The Overview

This dashboard shows an overview of the Instrumentation Score results for the OpenTelemetry Demo environment. The first thing you might notice is the very low overall score 35 (top-left corner). The table in the bottom-left corner shows a breakdown of the score by services. Somewhat surprisingly, all the service scores are higher than the overall score. How is that possible?

The main reason is that Instrumentation Score rules have, by definition, a binary result — passed or not. So it can happen that each service fails a single but distinct rule. Hence, the service score is not perfect but also not too bad. But, from the overall perspective, many rules have failed (each by a different service), hence, leading to a very low overall score.

In the table on the right, we see the results for the individual rules with their description, impact level, and example occurrences. We see that 7 out of 11 implemented rules have failed. Let's pick our favorite example from earlier — SPA-002 (in row 5), the orphan spans rule.

With the dashboard indicating that the rule SPA-002 has failed, we know that there are orphan spans somewhere in our OTel traces. But where exactly?

For further analysis, we have two ways to drill down: (1) into a specific rule to see which services violate a specific rule, or (2) into a specific service to see which rules are violated by that service.

Rule Drilldown

The following dashboard shows a detailed view into the rule evaluation results for individual rules. In this case we selected rule SPA-002 at the top.

In addition to the rule's meta information, such as its description, rationale, and criteria, we see some statistics on the right. For example, we see that 2 services have failed that rule, 16 passed, and for 19 services this rule is not applicable (e.g., because those don't have tracing data). In the table below, we see which two services are impacted by this rule violation: the frontend and frontend-proxy services. For each service, we also see the extent. In the case of the frontend service, around 20% of traces have orphan spans. This information is crucial as it gives an indication of how severe the rule violation actually is. If it had been under 1%, this problem might have been negligible, but with one trace out of five being broken, it definitely needs to be fixed. Also, for each of the services, we have an example span.id for which no spans could be found but that are referenced in the parent.id by other spans. This allows us to perform further analyses (e.g., by investigating the referring spans in Kibana's Discover) on concrete example cases.

With that view, we now know that the frontend service has a good amount of broken traces. But is that service also violating other rules? And, if yes, which?

Service Drilldown

To answer the above question we can switch to the Per Service Dashboard.

In this dashboard, we see similar information as on the overview dashboard, however, filtered on a single selected service (e.g., frontend service in this example). In the table, we see that the frontend service violates three rules. We already know about SPA-002 from the previous section. In addition, the violation of the custom rule SPA-C-001 shows that around 99% of transaction span names have high cardinality. In Elastic Observability, transactions refer to service-local root spans (i.e., entry points into services). In the example value, we see directly why the span.names (here referred to as transaction.names) have high cardinality. The span name contains unique identifiers (here the session ID) as part of the URL that the span name is constructed from in the instrumentation. As the EDOT Collector derives metrics for transaction-type spans, we also can observe a violation of the MET-001 which requires bound cardinality on metric dimensions.

As you can see, with the Instrumentation Score concept and a few different breakdown views, we were able to pinpoint data quality issues and identify which services and instrumentations need improvement to fix the issues.

Learnings and Observations

My experimentation with the Instrumentation Score was very insightful and showed me the power of this concept — though it's still in its early phase. It is particularly insightful if the implementation and calculation include breakdowns by meaningful entities, such as services, K8s pods, hosts, etc. With such a breakdown, you can narrow down data quality issues to a manageable scope, instead of having to sift through huge amounts of data and entities.

Furthermore, I realized that having some notion of problem extent (per rule and service), as well as concrete examples, helps make the problem more tangible.

Thinking further about the idea of rule violation extent, there might even be a way to incorporate that into the score formula itself. In my humble opinion, this would make the score significantly more comparable and indicative of the actual impact. I proposed this idea in an issue on the Instrumentation Score project.

Conclusion

The Instrumentation Score is a powerful approach to ensuring a high level of data quality with OpenTelemetry.

Thank you to the maintainers — Antoine Toulme, Daniel Gomez Blanco, Juraci Paixão Kröhling, and Michele Mancioppi — for bringing this great project to life, and to all the contributors for their participation!

With proper implementation of the rules and score calculation, users can easily get actionable insights into what they need to fix in their instrumentation and data collection. The Instrumentation Score rules are in an early stage and are steadily improved and extended. I'm looking forward to what the community will build in the scope of this project in the future, and I hope to intensify my contributions as well.

Revealing unknowns in your tracing data with inferred spans in OpenTelemetry

Mon, 22 Apr 2024 00:00:00 GMT

In the complex world of microservices and distributed systems, achieving transparency and understanding the intricacies and inefficiencies of service interactions and request flows has become a paramount challenge. Distributed tracing is essential in understanding distributed systems. But distributed tracing, whether manually applied or auto-instrumented, is usually rather coarse-grained. Hence, distributed tracing covers only a limited fraction of the system and can easily miss parts of the system that are the most useful to trace.

Addressing this gap, Elastic developed the concept of inferred spans as a powerful enhancement to traditional instrumentation-based tracing as an extension for the OpenTelemetry Java SDK/Agent. We are in the process of contributing this back to OpenTelemetry, until then our extension can be seamlessly used with the existing OpenTelelemetry Java SDK (as described below).

Inferred spans are designed to augment the visibility provided by instrumentation-based traces, shedding light on latency sources within the application or libraries that were previously uninstrumented. This feature significantly expands the utility of distributed tracing, allowing for a more comprehensive understanding of system behavior and facilitating a deeper dive into performance optimization.

What is inferred spans?

Inferred spans is an observability technique that combines distributed tracing with profiling techniques to illuminate the darker, unobserved corners of your application — areas where standard instrumentation techniques fall short. The inferred spans feature interweaves information derived from profiling stacktraces with instrumentation-based tracing data, allowing for the generation of new spans based on the insights drawn from profiling data.

This feature proves invaluable when dealing with custom code or third-party libraries that significantly contribute to the request latency but lack built-in or external instrumentation support. Often, identifying or crafting specific instrumentation for these segments can range from challenging to outright unfeasible. Moreover, certain scenarios exist where implementing instrumentation is impractical due to the potential for substantial performance overhead. For instance, instrumenting application locking mechanisms, despite their critical role, is not viable because of their ubiquitous nature and the significant latency overhead the instrumentation can introduce to application requests. Still, ideally, such latency issues would be visible within your distributed traces.

Inferred spans ensures a deeper visibility into your application’s performance dynamics including the above-mentioned scenarios.

Inferred spans in action

To demonstrate the inferred spans feature we will use the Java implementation of the Elastiflix demo application. Elasticflix has an endpoint called favorites that does some Redis calls and also includes an artificial delay. First, we use the plain OpenTelemetry Java Agent to instrument our application:

java -javaagent:/path/to/otel-javaagent-.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https:// \
"-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE" \
-jar my-service-name.jar

With the OpenTelemetry Java Agent we get out-of-the-box instrumentation for HTTP entry points and calls to Redis for our Elastiflix application. The resulting traces contain spans for the POST /favorites entrypoint, as well as a few short spans for the calls to Redis.

As you can see in the trace above, it’s not clear where most of the time is spent within the POST /favorites request.

Let’s see how inferred spans can shed light into these areas. You can use the inferred spans feature either manually with your OpenTelemetry SDK (see section below), package it as a drop-in extension for the upstream OpenTelemetry Java agent, or just use Elastic’s distribution of the OpenTelemetry Java agent that comes with the inferred spans feature.

For convenience, we just download the agent jar of the Elastic distribution and extend the configuration to enable the inferred spans feature:

java -javaagent:/path/to/elastic-otel-javaagent-.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https://XX.apm.europe-west3.gcp.cloud.es.io:443 \
"-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE" \
-Delastic.otel.inferred.spans.enabled=true \
-jar my-service-name.jar

The only non-standard option here is elastic.otel.inferred.spans.enabled: The inferred spans Feature is currently opt-in and therefore needs to be enabled explicitly. Running the same application with the inferred spans feature enabled yields more comprehensive traces:

The inferred-spans (colored blue in the above screenshot) follow the naming pattern Class#method. With that, the inferred spans feature helps us pinpoint the exact methods that contribute the most to the overall latency of the request. Note that the parent-child relationship between the HTTP entry span, the Redis spans, and the inferred spans is reconstructed correctly, resulting in a fully functional trace structure.

Examining the handleDelay method within the Elastiflix application reveals the use of a straightforward sleep statement. Although the sleep method is not CPU-bound, the full duration of this delay is captured as inferred spans. This stems from employing the async-profiler's wall clock time profiling, as opposed to solely relying on CPU profiling. The ability of the inferred spans feature to reflect actual latency, including for I/O operations and other non-CPU-bound tasks, represents a significant advancement. It allows for diagnosing and resolving performance issues that extend beyond CPU limitations, offering a more nuanced view of system behavior.

Using inferred spans with your own OpenTelemetry SDK

OpenTelemetry is a highly extensible framework: Elastic embraces this extensibility by also publishing most extensions shipped with our OpenTelemetry Java Distro as standalone-extensions to the OpenTelemetry Java SDK.

As a result, if you do not want to use our distro (e.g., because you don’t need or want bytecode instrumentation in your project), you can still use our extensions, such as the extension for the inferred spans feature. All you need to do is set up the OpenTelemetry SDK in your code and add the inferred spans extension as a dependency:


    co.elastic.otel
    inferred-spans
    {latest version}

During your SDK setup, you’ll have to initialize and register the extension:

InferredSpansProcessor inferredSpans = InferredSpansProcessor.builder()
  .samplingInterval(Duration.ofMillis(10)) //the builder offers all config options
  .build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
  .addSpanProcessor(inferredSpans)
.addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder()
    .setEndpoint("https://")
    .addHeader("Authorization", "Bearer ")
    .build()).build())
  .build();
inferredSpans.setTracerProvider(tracerProvider);

The inferred spans extension seamlessly integrates with the OpenTelemetry SDK Autoconfiguration mechanism. By incorporating the OpenTelemetry SDK and its extensions as dependencies within your application code — rather than through an external agent — you gain the flexibility to configure them using the same environment variables or JVM properties. Once the inferred spans extension is included in your classpath, activating it for autoconfigured SDKs becomes straightforward. Simply enable it using the elastic.otel.inferred.spans.enabled property, as previously described, to leverage the full capabilities of this feature with minimal setup.

How does inferred spans work?

The inferred spans feature leverages the capabilities of collecting wall clock time profiling data of the widely-used async-profiler, a low-overhead, popular production-time profiler in the Java ecosystem. It then transforms the profiling data into actionable spans as part of the distributed traces. But what mechanism allows for this transformation?

Essentially, the inferred spans extension engages with the lifecycle of span events, specifically when a span is either activated or deactivated across any thread via the OpenTelemetry context. Upon the activation of the initial span within a transaction, the extension commences a session of wall-clock profiling via the async-profiler, set to a predetermined duration. Concurrently, it logs the details of all span activations and deactivations, capturing their respective timestamps and the threads on which they occurred.

Following the completion of the profiling session, the extension processes the profiling data alongside the log of span events. By correlating the data, it reconstructs the inferred spans. It's important to note that, in certain complex scenarios, the correlation may assign an incorrect name to a span. To mitigate this and aid in accurate identification, the extension enriches the inferred spans with stacktrace segments under the code.stacktrace attribute, offering users clarity and insight into the precise methods implicated.

Inferred spans vs. correlation of traces with profiling data

In the wake of OpenTelemetry's recent announcement of the profiling signal, coupled with Elastic's commitment to donating the Universal Profiling Agent to OpenTelemetry, you might be wondering about how the inferred spans feature differentiates from merely correlating profiling data with distributed traces using span IDs and trace IDs. Rather than viewing these as competing functionalities, it's more accurate to consider them complementary.

The inferred spans feature and the correlation of tracing with profiling data both employ similar methodologies — melding tracing information with profiling data. However, they each shine in distinct areas. Inferred spans excels at identifying long-running methods that could escape notice with traditional CPU profiling, which is more adept at pinpointing CPU bottlenecks. A unique advantage of inferred spans is its ability to account for I/O time, capturing delays caused by operations like disk access that wouldn't typically be visible in CPU profiling flamegraphs.

However, the inferred spans feature has its limitations, notably in detecting latency issues arising from "death by a thousand cuts" — where a method, although not time-consuming per invocation, significantly impacts total latency due to being called numerous times across a request. While individual calls might not be captured as inferred spans due to their brevity, CPU-bound methods contributing to latency are unveiled through CPU profiling, as flamegraphs display the aggregate CPU time consumed by these methods.

An additional strength of the inferred spans feature lies in its data structure, offering a simplified tracing model that outlines typical parent-child relationships, execution order, and good latency estimates. This structure is achieved by integrating tracing data with span activation/deactivation events and profiling data, facilitating straightforward navigation and troubleshooting of latency issues within individual traces.

Correlating distributed tracing data with profiling data comes with a different set of advantages. Learn more about it in our related blog post, Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation.

What about the performance overhead?

As mentioned before, the inferred spans functionality is based on the widely used async-profiler, known for its minimal impact on performance. However, the efficiency of profiling operations is not without its caveats, largely influenced by the specific configurations employed. A pivotal factor in this balancing act is the sampling interval — the longer the interval between samples, the lower the incurred overhead, albeit at the expense of potentially overlooking shorter methods that could be critical to the inferred spans feature discovery process.

Adjusting the probability-based trace sampling presents another way for optimization, directly influencing the overhead. For instance, setting trace sampling to 50% effectively halves the profiling load, making the inferred spans feature even more resource-efficient on average per request. This nuanced approach to tuning ensures that the inferred spans feature can be leveraged in real-world, production environments with a manageable performance footprint. When properly configured, this feature offers a potent, low-overhead solution for enhancing observability and diagnostic capabilities within production applications.

What’s next for inferred spans and OpenTelemetry?

This blog post outlined and introduced the inferred spans feature available as an extension for the OpenTelemetry Java SDK and built into the newly introduced Elastic OpenTelemetry Java Distro. Inferred spans allows users to troubleshoot latency issues in areas of code that are not explicitly instrumented while utilizing traditional tracing data.

The feature is currently merely a port of the existing feature from the proprietary Elastic APM Agent. With Elastic embracing OpenTelemetry, we plan on contributing this extension to the upstream OpenTelemetry project. For that, we also plan on migrating the extension to the latest async-profiler 3.x release. Try out inferred spans for yourself and see how it can help you diagnose performance problems in your applications.