From distributed tracing to distributed profiling with Elastic APM | Elastic Blog
Engineering

From distributed tracing to distributed profiling with Elastic APM

Distributed tracing is great — it helps you identify (micro)services within complex architectures having issues interfering with user experience, such as high latency or errors.

But once a problematic service is identified, it can be difficult to find out which methods are to blame for the slowdown. Well, we have some big news to share for our Elastic APM users within the Java ecosystem. We have now added support for additional inferred spans by leveraging async-profiler, providing visibility into which methods are taking longer within your traces.

Recap: What is tracing?

Before we dig deeper into our new inferred spans functionality, let's take a step back and look at how tracing usually works. Typically, tracing tools record incoming and outgoing requests (such as to downstream services and databases) of your application. Among other things, they record metadata like the duration, the URL, and HTTP headers. The recording of an incoming request is called a transaction. Other tracing systems may refer to them as entry spans.

Tracing outgoing requests (called spans or exit spans) serves two purposes:

  • To propagate a unique trace identifier to downstream services so that they can participate in the same trace. This is important to have the full picture of which services were involved in serving an end user-initiated request.
  • To gain insight into the requests to external services or databases, as they are usually the number one cause of latency, especially in microservice architectures.

To learn more about tracing, check out the Distributed Tracing, OpenTracing and Elastic APM blog.

What if the latency is not caused by making external requests?

If your requests are slow because of heavy computations, inefficient algorithms, or similar problems not related to calling other services, you are usually out of luck with what most APM/tracing tools show you out of the box.

But with the Elastic APM Java agent there are multiple ways to get additional method-level insights.

Create spans manually

You can use either our public agent API or the OpenTracing API to manually create spans for methods of interest. If you want a bit more convenience, there are also annotations, such as @Traced, that you can place on top of your methods to automatically create a span for them. These options have the same drawback — you have to know in advance which methods you want to monitor. If you have no clue which methods might be slow, this can turn into a painful trial-and-recompile-and-deploy-and-error cycle.

Configure which methods to instrument

Since version 1.3 of the Java agent, there's also the trace_methods configuration option. This option lets you define which methods you want to instrument with a pointcut-like syntax. The advantage is that there are no code changes required and that you can also instrument third-party code with it. But with great power comes great responsibility: It's not rare that we see people trying to instrument all methods within a codebase with a wildcard trace_methods expression like com.example.*. While this is a convenient way to monitor all your methods at once, it can have a disastrous effect on performance. In some cases, it might seem like it works fine in a development environment but fails miserably when load is applied. Don't get me wrong, instrumenting a single method is relatively lightweight and does not add more than about a microsecond of latency per method invocation. But if you instrument every single method within a codebase — including those that normally execute in less than a microsecond and get executed very frequently (for example in a tight loop) — the overhead can quickly add up. In addition to the runtime CPU overhead, the agent will create, serialize, and send each span, which will then get stored in the Elasticsearch span index, intensifying the related overhead. Finally, the trace details views are not optimized to show traces with thousands of spans. 

If you need visibility into an extensive list of methods, we recommend using the new profiler-inferred spans based approach to keep the overhead down and to get additional visibility into your services. If you have used trace_methods for this before, you should consider using profiler-inferred spans instead.

Create spans based on a sampling profiler

With version 1.14.0 of the Java agent, we have released an experimental version of a totally different approach to this problem. We have integrated with the popular open source profiler async-profiler, which is a sampling (aka statistical) profiler to provide you additional visibility into your methods. The overhead of this new approach is extremely low as it does not insert measurement points into all of the methods. Instead, async-profiler periodically requests the stack trace from all threads actively running a transaction. That is similar to a movie camera that takes frequent snapshots of what's happening at the moment instead of recording every single event: it does not have much overhead on the system and allows you to understand what happens, as a movie does.

We then correlate the stack traces gathered by async-profiler with span activation events. An activation event consists of ID of the span and the time frame during which it has been active on which thread. As a result, we create profiler-inferred spans for all the slow methods. With that, our agent can detect what is executed between transactions and spans.

In this screenshot, we can clearly see that the culprit of the latency is the WelcomeController#think method, reported to take close to 5 seconds:

When we look at the implementation of the method we can see that it just calls Thread.sleep(5000); which makes the current thread wait for 5 seconds.

Without inferred spans, we would only be able to see the transaction (WelcomeController#welcome) and the View#render span. Good luck finding the root cause in this case!

This is of course just a trivial example, but it shows how this enables you to easily spot slow methods.

Advantages of profiler-based inferred spans

There are two main advantages of using the profiler-based inferred spans approach to add visibility into additional methods:

  • The profiler-based approach is designed to be low-overhead enough to run in production. It can thus continuously run and provide insights into slow methods.
  • It is able to catch all slow methods without configuration, which means you don't have to specify any specific method names up front. The profiler automatically bubbles up these slow methods as spans in your distributed tracing UI.

When do I need profiler-inferred spans?

Having profiler-inferred spans directly in the waterfall view lets you analyze the performance of a particular instance of a transaction. This is especially useful for these use cases:

  • During development: When trying to find out why the request you just made was slow.
  • During load testing or in production: When analyzing why some requests are slower than others.
  • For customer support: When a user complains that a particular request they made at noon was slow, especially if you can't reproduce that slowness in your development or staging environment.

Our profiler secret sauce

Not all sampling profilers work in the same way. While the underlying mechanism of taking snapshots of a thread's stack trace is similar, exactly how the stack trace is taken makes a big difference in accuracy and overhead. With async-profiler, it is possible to enable profiler-inferred spans in production without causing noticeable response time overhead.

Other alternatives to get the stack trace of Java threads, including ThreadMXBean#getThreadInfo, ThreadMXBean#dumpAllThreads, Thread#getStackTrace(), and the JVMTI method GetStackTrace all have the same set of problems, as they take stack traces at a global JVM safepoint. If you want to learn more about what that means, I recommend this sampling java profilers blog.

Ultimately, the disadvantages of safepoint sampling profilers are twofold:

  • The resulting profile is inaccurate due to safepoint-bias. 
  • It increases stop-the-world-pauses of the JVM. That is as fatal as it sounds. In a stop-the-world pause, all Java threads freeze and the application can make no progress. This happens because the JVM has to wait until all threads reach a safepoint before it can take a thread dump.

Async-profiler, on the other hand, is using the AsyncGetCallTrace method that doesn't require the JVM to be in a safepoint when requesting the stack trace of a given thread. This method has several quirks and is not easy to work with; however, async-profiler has implemented several fallback strategies for getting the stack trace even if AsyncGetCallTrace can't walk the stack.

Tuning options

By default, the profiler is turned off and you need to enable it by setting profiling_inferred_spans_enabled to true. In initial load tests, we did not see a significant increase in latency with profiler-inferred spans enabled.

The accuracy of the inferred spans can be adjusted with the profiling_inferred_spans_sampling_interval configuration option. This works similarly to setting the frames per second of a movie camera. By default, it's set to 50ms. The lower the sampling interval, the higher the accuracy and the level of detail of the inferred spans. Of course, the higher the level of detail, the higher the profiler overhead and the Elasticsearch index sizes. As most of the processing is done in the background, the impact on the response time of user requests is negligible.

If you want high accuracy, but also want to avoid cluttering the UI with many fast-executing methods, consider setting profiling_inferred_spans_min_duration. It filters out inferred spans that are faster than the configured threshold.

There's also a set of options you can use to include or to exclude specific classes. By default, the classes from the JDK and from most application servers are excluded. This reduces the number of uninteresting inferred spans. Generally, the fewer classes that are included, the faster and the more memory efficient the processing is.

Note that because this is an experimental feature, its default options might change before it is generally available.

Caveats

As this approach is based on sampling, the inferred spans are not exact measurements but estimations. In general, the inferred spans might start a little bit after the method actually started and end before the method actually ended. That's why the duration of the WelcomeController#think span in the screenshot above is 16ms shy of 5 seconds. This one detail can lead to several inconsistencies, which are all documented in the Java agent README.

As inferred spans are created after the profiling session ends, they may appear a little bit later in the UI than the regular spans. Because of that, a regular span can't be a child of an inferred span: the span may already be indexed in Elasticsearch and its parent.id already points to another regular span or transaction. The good news is that we're working on a solution for this.

We're also planning to improve the UI support for inferred spans. You can check the current status in the Kibana issues.

But where are the flamegraphs?

This is only the beginning of our integration with async-profiler. We plan to integrate the CPU, allocation, and lock profiler so that you can view the profiles of multiple instances of your application at once via a flamegraph in the APM UI. This will become a great tool to analyze where your services spend their resources in general. Stay tuned to hear further updates on this exciting feature!

Try it out in your own system or in a 14-day free trial of Elasticsearch Service, and let us know what you think on discuss. To get started, refer to the documentation to see how to set up the agent and how to enable inferred spans.