Gaining new perspectives beyond logging: An introduction to application performance monitoring

log-management-720x420_(2).jpeg

Prioritize customer experience with APM and tracing

Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is application performance monitoring (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction. 

Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution. 

There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.

The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.

What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.

Cultivating a new monitoring mindset: How to drive APM and tracing adoption

To get teams to shift their troubleshooting mindset, what organizational changes need to be made? 

Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities. 

This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes. 

From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.

In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption: 

  1. Ingest: What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?
  2. Integrate: How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?
  3. Analytics and AIOPs: Improve the customer experience and reduce the noise through machine learning.
  4. Scale and total cost of ownership: Roll out enterprise-wide tracing and adopt strategies to deal with data volume.

1. Ingest

Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.

Getting started with instrumentation

What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:

timing methods

Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.

From here things started to evolve, and one of the first additions to APM was to add in tracing. 

For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did. 

OpenTelemetry has agents like this, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling. 

Things have moved on since then, and Node.JS and Python are now popular. 

As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar here, which supports Python, Node.JS, Java, and DotNet. 

The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this. 

You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and here.

from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers="Authorization=Bearer%20"+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route("/completion")
@tracer.start_as_current_span("do_work")
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count",1)

By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.

Sampling

Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made. 

Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.

Head-based sampling
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.  

With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.

Tail-based sampling
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold. 

We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.

OpenTelemetry Semantic Conventions and Elastic Common Schema

OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.

Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.

Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances. 

Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security. 

Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.

We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.

2. Integrate

Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for synthetic monitoring and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.

APM + logs = superpowers!

Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. This is normally simple to enable.

With this enabled, you will also get automated injection of useful fields like these:

  • service.name, service.version, service.environment
  • trace.id, transaction.id, error.id

This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:

latency distribution

If this is available to you, we highly recommend turning it on.

Deploying APM inside Kubernetes

It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this. 

1. Auto instrumentation using sidecars
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications. 

The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use Kustomize to add the required command line arguments to bootstrap your agents. 

If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.

OpenTelemetry has an operator that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or helm charts that can achieve the same result.

2. Baking APM into containers or code
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.

Have a look at an example here using the OpenTelemetry Java Agent:

# Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar

3. Tracing using a service mesh (Envoy/Istio)
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.

The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. Learn more about this option.

Synthetics Universal Profiling

Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and continuous profiling being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.

Synthetics
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.

This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases. 

Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.

observability rainbow sandals

Universal profiling
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. “Continuous profiling” is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.

Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like eBPF can allow you to see all the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”

Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together. 

Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.

observability universal profiling

3. Analytics and AIOps

The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from inside applications. 

This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.

In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.

Getting business data into your traces

There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available here and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports adding extensions in its auto-instrumentation agents. Most other APM vendors usually have something similar.

def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)
        
        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count", counters['completion_count'])
            span.set_attribute("token_count", token_count)
            span.set_attribute("prompt_tokens", prompt_tokens)
            span.set_attribute("completion_tokens", completion_tokens)
            span.set_attribute("model", response.model)
            span.set_attribute("cost", cost)
            span.set_attribute("response", strResponse)
        return response
    return wrapper

Using business data for fun and profit

Once you have the business data in your traces, you can start to have some fun with it.  Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.

customer count

A lot of this is powered by machine learning, which can classify transactions or do anomaly detection. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.

fraud 12-h

SLIs and SLOs

Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations. 

By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.

In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.

  • Define SLOs and SLIs required to track them.
  • Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.
  • Define SLIs to be measured from a user experience perspective.
  • Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.

Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.

4. Scale and total cost of ownership

With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned). 

In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data. 

With this configuration in place, your “Observability pipeline” architecture would look like this:

opentelemetry collector

This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.

Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation. 

And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage. 

This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.

Conclusion

In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.

In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.

By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.

Learn more about APM at Elastic.