Best practices for instrumenting OpenTelemetry

OpenTelemetry (OTel) is steadily gaining broad industry adoption. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability. With that, teams can rely on vendor-agnostic, future-proof instrumentation of their applications that allows them to switch observability backends without additional overhead in adapting instrumentation.

Teams that have chosen OpenTelemetry for instrumentation face a choice between different instrumentation techniques and data collection approaches. Determining how to instrument and what mechanism to use can be challenging. In this blog, we will go over Elastic’s recommendations around some best practices for OpenTelemetry instrumentation:

Automatic or manual? We’ll cover the need for one versus the other and provide recommendations based on your situation.
Collector or direct from the application? While the traditional option is to use a collector, observability tools like Elastic Observability can take telemetry from OpenTelemetry applications directly.
What to instrument from OTel SDKs. Traces and metrics are well contributed to (per the table in OTel docs), but logs are still in progress. Elastic^® is improving the progress with its contribution of ECS to OTel. Regardless of the status from OTel, you need to test and ensure these instrumentations work for you.
Advantages and disadvantages of OpenTelemetry

OTel automatic or manual instrumentation: Which one should I use?

While there are two ways to instrument your applications with OpenTelemetry — automatic and manual — there isn’t a perfect answer, as it depends on your needs. There are pros and cons of using one versus another, such as:

Auto-magic experience vs. control over instrumentation
Customization vs. out-of-the-box data
Instrumentation overhead
Simplicity vs. flexibility

Additionally, you might even land on a combination depending on availability and need.

Let’s review both automatic and manual instrumentation and explore specific recommendations.

Auto-instrumentation

For most of the programming languages and runtimes, OpenTelemetry provides an auto-instrumentation approach for gathering telemetry data. Auto-instrumentation provides a set of pre-defined, out-of-the-box instrumentation modules for well-known frameworks and libraries. With that, users can gather telemetry data (such as traces, metrics, and logs) from well-known frameworks and libraries used by their application with only minimal or even no need for code changes.

Here are some of the apparent benefits of using auto-instrumentation:

Quicker development and path to production. Auto-instrumentation saves time by accelerating the process of integrating telemetry into an application, allowing more focus on other critical tasks.
Simpler maintenance by only having to update one line, which is usually the container start command where auto-instrumentation is configured, versus having to update multiple lines of code across multiple classes, methods, and services.
Easier to keep up with the latest features and improvements in the OpenTelemetry project without manually updating the instrumentation of used libraries and/or code.

There are also some disadvantages and limitations of the auto-instrumentation approach:

Auto-instrumentation collects telemetry data only for the frameworks and libraries in use for which an explicit auto-instrumentation module exists. In particular, it’s unlikely that auto-instrumentation would collect telemetry data for “exotic” or custom libraries.
Auto-instrumentation does not capture telemetry for pure custom code (that does not use well-known libraries underneath).
Auto-instrumentation modules come with a pre-defined, opinionated instrumentation logic that provides sufficient and meaningful information in the vast majority of cases. However, in some custom edge cases, the information value, structure, or level of detail of the data provided by auto-instrumentation modules might be not sufficient.
Depending on the runtime, technology, and size of the target application, auto-instrumentation may come with a (slightly) higher start-up or runtime overhead compared to manual instrumentation. In the majority of cases, this overhead is negligible but may become a problem in some edge cases.

Here is an example of a Python application that was auto-instrumented with OpenTelemetry. If you had a Python application locally, you would add the code below to auto-instrument:

opentelemetry-instrument \
    --traces_exporter OTEL_TRACES_EXPORTER \
    --metrics_exporter OTLP_METRICS_EXPORTER \
    --service_name OTLP_SERVICE_NAME \
    --exporter_otlp_endpoint OTEL_EXPORTER_TRACES_ENDPOINT \
    python main.py

Learn more about auto-instrumentation with OpenTelemetry for Python applications.

Finally, developers familiar with OpenTelemetry's APIs can leverage their existing knowledge by using auto-instrumentation, avoiding the complexities that may arise from manual instrumentation. However, manual instrumentation might still be preferred for specific use cases or when custom requirements cannot be fully addressed by auto-instrumentation.

Combination: Automatic and Manual

Before we proceed with manual instrumentation, you can also use a combination of automatic and manual instrumentation. As we noted above, if you start to understand the application’s behavior, then you can determine if you need some additional instrumentation for code that is not being traced by auto-instrumentation.

Additionally, because not all the auto-instrumentation is equal across the OTel language set, you will probably need to manually instrument in some cases — for example, if auto-instrumentation of a Flask-based Python application doesn’t automatically show middleware calls like calls to the requests library. In this situation, you will have to go with manual instrumentation for the Python application if you want to also see middleware tracing. However, as these libraries mature, more support options will become available.

A combination is where most developers will ultimately land when the application gets to near production quality.

Manual instrumentation

If the auto-instrumentation does not cover your needs, you want to have more control over the instrumentation, or you’d like to treat instrumentation as code, using manual instrumentation is likely the right choice for you. As described above, you can use it as an enhancement to auto-instrumentation or entirely switch to manual instrumentation. If you eventually go down a path of manual instrumentation, it definitely provides more flexibility but also means you will have to not only code in the traces and metrics but also maintain it regularly.

As new features are added and changes to the libraries are made, the maintenance for the code may or may not be cumbersome. It’s a decision that requires some forethought.

Here are some reasons why you would potentially use manual instrumentation:

You may already have some OTel instrumented applications using auto-instrumentation and need to add more telemetry for specific functions or libraries (like DBs or middleware), thus you will have to add manual instrumentation.
You need more flexibility and control in terms of the application language and what you’d like to instrument.
In case there's no auto-instrumentation available for your programming language and the technologies in use, manual instrumentation would be the way to go for your applications built using these languages.
You might have to instrument for logging with an alternative approach, as logging is not yet stable for all the programming languages.
You need to customize and enrich your telemetry data for your specific use cases — for example, you have a multi-tenant application and you need to get each tenant’s information and then use manual instrumentation via the OpenTelemetry SDK.

Recommendations for manual instrumentation
Manual instrumentation will require specific configuration to ensure you have the best experience with OTel. Below are Elastic’s recommendations (as outlined by the CNCF), for gaining the most benefits when instrumenting using the manual method:

Ensure that your provider configuration and tracer initialization is done properly.
Ensure you set up spans in all the functions you want traced.
Set up resource attributes correctly.
Use batch rather than simple processing.

Let’s review these individually:

1. Ensure that your provider configuration and tracer initialization is done properly.
The general rule of thumb is to ensure you configure all your variables and tracer initialization in the front of the application. Using the Elastiflix application’s Python favorite service as an example, we can see:

Tracer being set up globally

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

...


resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(otel_service_name)

In the above, we’ve added the OpenTelemetry trace module and imported the TraceProvider , which is the entry point of the API. It provides access to the Tracer, which is the class responsible for creating spans.

Additionally, we specify the use of BatchSpanProcessor. The span processor is an interface that provides hooks for span start and end method invocations.

In OpenTelemetry, different span processors are offered. The BatchSpanProcessor batches span and sends them in bulk. Multiple span processors can be configured to be active at the same time using the MultiSpanProcessor. See OpenTelemetry Documentation.

The variable otel_service_name is set in with environment variables (i.e., OTLP ENDPOINT and others) also set up globally. See below:

otel_service_name = os.environ.get('OTEL_SERVICE_NAME') or 'favorite_otel_manual'
environment = os.environ.get('ENVIRONMENT') or 'dev'
otel_service_version = os.environ.get('OTEL_SERVICE_VERSION') or '1.0.0'

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')

In the above code, we initialize several variables. Because we also imported Resource, we initialize several variables:

Resource variables (we will cover this later in this article):

otel_service_name – This helps set the name of the service (service.name) in otel Resource attributes.
otel_service_version – This helps set the version of the service (service.version) in OTel Resource attributes.
environment – This helps set the deployment.environment variable in OTel Resource attributes.

Exporter variables:

otel_exporter_otlp_endpoint – This helps set the OTLP endpoint where traces, logs, and metrics are sent. Elastic would be an OTLP endpoint. You can also use OTEL_TRACES_EXPORTER or OTEL_METRICS_EXPORTER if you want to only send traces and/or metrics to specific endpoints.
Otel_exporter_otlp_headers – This is the authorization needed for the endpoint.

The separation of your provider and tracer configuration allows you to use any OpenTelemetry provider and tracing framework that you choose.

2. Set up your spans inside the application functions themselves.
Make sure your spans end and are in the right context so you can track the relationships between spans. In our Python favorite application, the function that retrieves a user’s favorite movies shows:

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    # add artificial delay if enabled
    if delay_time > 0:
        time.sleep(max(0, random.gauss(delay_time/1000, delay_time/1000/10)))

    with tracer.start_as_current_span("get_favorite_movies") as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            "event.dataset": "favorite.log",
            "user.id": request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            "event.dataset": "favorite.log",
            "user.id": user_id
        })
        return { "favorites": favorites}

While you can instrument every function, it’s strongly recommended that you instrument what you need to avoid a flood of data. The need will be dependent not only on the development process needs but also on what SRE and potentially the business needs to observe with the application. Instrument for your target use cases.

Also, avoid instrumenting trivial/utility methods/functions or such that are intended to be called extensively (e.g., getter/setter functions). Otherwise, this would produce a huge amount of telemetry data with very low additional value.

3. Set resource attributes and use semantic conventions

_ Resource attributes _
Attributes such as service.name, tracer, development.environment, and cloud are important in managing version, environment, cloud provider, etc. for the specific service. Resource attributes describe resources such as hosts, systems, processes, and services and do not change during the lifetime of the resource. Resource attributes are a great help for correlating data, providing additional context to telemetry data and, thus, helping narrow down root causes of problems during troubleshooting. While it is simple to set up in auto-instrument, you need to ensure you also send these through in your application.

Check out OpenTelemetry’s list of attributes that can be set in the OTel documentation.

In our auto-instrumented Python application from above, here is how we set up resource attributes:

opentelemetry-instrument \
    --traces_exporter console,otlp \
    --metrics_exporter console \
    --service_name your-service-name \
    --exporter_otlp_endpoint 0.0.0.0:4317 \
    python myapp.py

However, when instrumenting manually, you need to add your resource attributes and ensure you have consistent values across your application’s code. Resource attributes have been defined by OpenTelemetry’s Resource Semantic Convention and can be found here. In fact, your organization should have a resource attribute convention that is applied across all applications.

These attributes are added to your metrics, traces, and logs, helping you filter out data, correlate, and make more sense out of them.

Here is an example of setting resource attributes in our Python service:

resource_attributes = {
    "service.name": otel_service_name,
    "telemetry.version": otel_service_version,
    "Deployment.environment": environment

}

resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)

We’ve set up service.name, service.version, and deployment.environment. You can set up as many resource attributes as you need, but you need to ensure you pass the resource attributes into the tracer with provider = TracerProvider(resource=resource).

_ Semantic conventions _
In addition to adding the appropriate resource attributes to the code, the OpenTelemetry semantic conventions are important. Another one is about semantic conventions for specific technologies used in building your application with specific infrastructure. For example, if you need to instrument databases, there is no automatic instrumentation. You will have to manually instrument for tracing against the database. In doing so, you should utilize the semantic conventions for database calls in OpenTelemetry.

Similarly, if you are trying to trace Kafka or RabbitMQ, you can follow the OpenTelemetry semantic conventions for messaging systems.

There are multiple semantic conventions across several areas and signal types that can be followed using OpenTelemetry — check out the details.

4. Use Batch or simple processing?
Using simple or batch processing depends on your specific observability requirements. The advantages of batch processing include improved efficiency and reduced network overhead. Batch processing allows you to process telemetry data in batches, enabling more efficient data handling and resource utilization. On the other hand, batch processing increases the lag time for telemetry data to appear in the backend, as the span processor needs to wait for a sufficient amount of data to send over to the backend.

With simple processing, you send your telemetry data as soon as the data is generated, resulting in real-time observability. However, you will need to prepare for higher network overhead and more resources required to process all the separate data transmissions.

Here is what we used to set this up in Python:

from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

Your observability goals and budgetary constraints are the deciding factors when choosing batch or simple processing. A hybrid approach can also be implemented. If real-time insights are critical for an ecommerce application, for example, then simple processing would be the better approach. For other applications where real-time insights are not crucial, consider batch processing. Often, experimenting with both approaches and seeing how your observability backend handles the data is a fruitful exercise to hone in on what approach works best for the business.

Use the OpenTelemetry Collector or go direct?

When starting out with OpenTelemetry, ingesting and transmitting telemetry data directly to a backend such as Elastic is a good way to get started. Often, you would be using the OTel direct method in the development phase and in a local environment.

However, as you deploy your applications to production, the applications become fully responsible for ingesting and sending telemetry data. The amount of data sent in a local environment or during development would be miniscule compared to a production environment. With millions or even billions of users interacting with your applications, the work of ingesting and sending telemetry data in addition to the core application functions can become resource-intensive. Thus, offloading the collection, processing, and exporting of telemetry data over to a backend such as Elastic using the vendor-agnostic OTel Collector would enable your applications to perform more efficiently, leading to a better customer experience.

Advantages of using the OpenTelemetry Collector

For cloud-native and microservices-based applications, the OpenTelemetry Collector provides the flexibility to handle multiple data formats and, more importantly, offloads the resources required from the application to manage telemetry data. The result: reduced application overhead and ease of management as the telemetry configuration can now be managed in one place.

The OTel Collector is the most common configuration because the OTel Collector is used:

To enrich the telemetry data with additional context information — for example, on Kubernetes, the OTel Collector would take the responsibility to enrich all the telemetry with the corresponding K8s pod and node information (labels, pod-name, etc.)
To provide uniform and consistent processing or transform telemetry data in a central place (i.e., OTel Collector) rather than take on the burden of syncing configuration across hundreds of services to ensure consistent processing
To aggregate metrics across multiple instances of a service, which is only doable on the OTel Collector (not within individual SDKs/agents)

Key features of the OpenTelemetry Collector include:

Simple setup: The setup documentation is clear and comprehensive. We also have an example setup using Elastic and the OTel Collector documented from this blog.
Flexibility: The OTel Collector offers many configuration options and allows you to easily integrate into your existing observability solution. However, OpenTelemetry’s pre-built distributions allow you to start quickly and build the features that you need. Here as well as below is an example of the code that we used to build our collector for an application running on Kubernetes.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otelcollector
spec:
  selector:
    matchLabels:
      app: otelcollector
  template:
    metadata:
      labels:
        app: otelcollector
    spec:
      serviceAccountName: default
      terminationGracePeriodSeconds: 5
      containers:
        - command:
            - "/otelcol"
            - "--config=/conf/otel-collector-config.yaml"
          image: otel/opentelemetry-collector:0.61.0
          name: otelcollector
          resources:
            limits:
              cpu: 1
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi

Collect host metrics: Using the OTel Collector allows you to capture infrastructure metrics, including CPU, RAM, storage capacity, and more. This means you won’t need to install a separate infrastructure agent to collect host metrics. An example OTel configuration for ingesting host metrics is below.

receivers:
  hostmetrics:
    scrapers:
      cpu:
      disk:

Security: The OTel Collector operates in a secure manner by default. It can filter out sensitive information based on your configuration. OpenTelemetry provides these security guidelines to ensure your security needs are met.
Tail-based sampling for distributed tracing: With OpenTelemetry, you can specify the sampling strategy you would like to use for capturing traces. Tail-based sampling is available by default with the OTel Collector. With tail-based sampling, you control and thereby reduce the amount of trace data collected. More importantly, you capture the most relevant traces, enabling you to spot issues within your microservices applications much faster.

What about logs?

OpenTelemetry’s approach to ingesting metrics and traces is a “clean-sheet design.” OTel developed a new API for metrics and traces and implementations for multiple languages. For logs, on the other hand, due to the broad adoption and existence of legacy log solutions and libraries, support from OTel is the least mature.

Today, OpenTelemetry’s solution for logs is to provide integration hooks to existing solutions. Longer term though, OpenTelemetry aims to incorporate context aggregation with logs thus easing logging correlation with metrics and traces. Learn more about OpenTelemetry’s vision.

Elastic has written up its recommendations in the following article: 3 models for logging with OpenTelemetry and Elastic. Here is a brief summary of what Elastic recommends:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol.
Write logs from your service to a file scrapped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol.
Write logs from your service to a file scrapped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol.

The third approach, where developers have their logs scraped using an Elastic Agent, is the recommended approach, as Elastic provides a widely adopted and proven method for capturing logs from applications and services using OTel. The first two approaches, although both use OTel instrumentation, are not yet mature and aren't ready for production-level applications.

Get more details about the three approaches in this Elastic blog which includes a deep-dive discussion with hands-on implementation, architecture, advantages, and disadvantages.

It’s not all sunshine and roses

OpenTelemetry is definitely beneficial to obtaining observability for modern cloud-native distributed applications. Having a standardized framework for ingesting telemetry reduces operational expenses and allows the organization to focus more on application innovation. Even with all the advantages of using OTel, there are some limitations that you should be aware of as well.

But first, here are the advantages of using OpenTelemetry:

Standardized instrumentation: Having a consistent method for instrumenting systems up and down the stack gives organizations more operational efficiency and cost-effective observability.
Auto-instrumentation: OTel provides organizations with the ability to auto-instrument popular libraries and frameworks enabling them to quickly get up and running and requiring minimal changes to the codebase.
Vendor neutrality: Organizations don’t have to be tied to one vendor for their observability needs. In fact, they can use several of them, using OTel to try one out or have a more best-of-breed approach if desired.
Future-proof instrumentation: Since OpenTelemetry is open-source and has a vast ecosystem of support, your organization will be using technology that will be constantly innovated and can scale and grow with the business.

There are some limitations as well:

Instrumenting with OTel is a fork-lift upgrade. Organizations must be aware that time and effort needs to be invested to migrate proprietary instrumentation to OpenTelemetry.
The language SDKs are at a different maturity level, so applications with alpha, beta, or experimental functional support may not provide the organization with the full benefits in the short term.

Over time, the disadvantages will be reduced, especially as the maturity level of the functional components improves. Check the OpenTelemetry status page for updates on the status of the language SDKs, the collector, and overall specifications.

Using Elastic and migrating to OpenTelemetry at your speed

Transitioning to OpenTelemetry is a challenge for most organizations, as it requires retooling existing proprietary APM agents on almost all applications. This can be daunting, but OpenTelemetry agents provide a mechanism to avoid having to modify the source code, otherwise known as auto-instrumentation. With auto-instrumentation, the only code changes will be to rip out the proprietary APM agent code. Additionally, you should also ensure you have an observability tool that natively supports OTel without the need for additional agents, such as Elastic Observability.

Elastic recently donated Elastic Common Schema (ECS) in its entirety to OTel. The goal was to ensure OTel can get to a standardized logging format. ECS, developed by the Elastic community over the past few years, provides a vehicle to allow OTel to provide a more mature logging solution.

Elastic provides native OTel support. You can directly send OTel telemetry into Elastic Observability without the need for a collector or any sort of processing normally used in the collector.

Here are the configuration options in Elastic for OpenTelemetry:

Most of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.

Although OpenTelemetry supports many programming languages, the status of its major functional components — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.

For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your observability platform in mixed mode (Elastic Agents with OpenTelemetry agents).

We ran a variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.

As a result, you can take advantage of the benefits of OpenTelemetry, which include:

Standardization: OpenTelemetry provides a standard approach to telemetry collection, enabling consistency of processes and easier integration of different components.
Vendor-agnostic: Since OpenTelemetry is open source, it is designed to be vendor-agnostic, allowing DevOps and SRE teams to work with other monitoring and observability backends reducing vendor lock-in.
Flexibility and extensibility: With its flexible architecture and inherent design for extensibility, OpenTelemetry enables teams to create custom instrumentation and enrich their own telemetry data.
Community and support: OpenTelemetry has a growing community of contributors and adopters. In fact, Elastic contributed to developing a common schema for metrics, logs, traces, and security events. Learn more here.

Once the other languages reach a stable state, you can then continue your migration to OpenTelemetry agents.

Summary

OpenTelemetry has become the de facto standard for ingesting metrics, traces, and logs from cloud-native applications. It provides a vendor-agnostic framework for collecting telemetry data, enabling you to use the observability backend of your choice.

Auto-instrumentation using OpenTelemetry is the fastest way for you to ingest your telemetry data and is an optimal way to get started with OTel. However, using manual instrumentation provides more flexibility, so it is often the next step in gaining deeper insights from your telemetry data.

OpenTelemetry visualization also allows you to ingest your data directly or by using the OTel Collector. For local development, going direct is a great way to get your data to your observability backend; however, with production workloads, using the OTel Collector is recommended. The collector takes care of all the data ingestion and processing, enabling your applications to focus on functionality and not have to deal with any telemetry data tasks.

Logging functionality is still at a nascent stage with OpenTelemetry, while ingesting metrics and traces is well established. For logs, if you’ve started down the OTel path, you can send your logs to Elastic using the OTLP protocol. Since Elastic has a very mature logging solution, a better approach would be to use an Elastic Agent to ingest logs.

Although the long-term benefits are clear, organizations need to be aware that adopting OpenTelemetry means they would own their own instrumentation. Thus, appropriate resources and effort need to be incorporated in the development lifecycle. Over time, however, OpenTelemetry brings standardization to telemetry data ingestion, offering organizations vendor-choice, scalability, flexibility, and future-proofing of investments.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.