Engineering

A cookbook for contributing a plugin to the Elastic APM Java agent

Ideally, an APM agent would automatically instrument and trace any framework and library known to exist. In reality, what APM agents support reflect a combination of capacity and prioritization. Our list of supported technologies and frameworks is constantly growing according to prioritization based on our valued users' input. Still, if you are using Elastic APM Java agent and miss something not supported out of the box, there are several ways you can get it traced.

For example, you can use our public API to trace your own code, and our awesome custom-method-tracing configuration for basic monitoring of specific methods in third-party libraries. However, if you want to get extended visibility to specific data from third party code, you may need to do a bit more. Fortunately, our agent is open source, so you can do everything we can do. And while you’re at it, why not share it with the community? A big advantage of that is getting wider feedback and have you code running on additional environments.

We will gratefully welcome any contribution that extends our capabilities, as long as it meets several standards we must enforce, just as our users expect of us. For example check out this PR for supporting OkHttp client calls or this extension to our JAX-RS support. So, before hitting the keyboard and start coding, here are some things to keep in mind when contributing to our code base, presented alongside a test case that will assist with this plugin implementation guide.

Test case: Instrumenting Elasticsearch Java REST client

Before releasing our agent, we wanted to support our own datastore client. We wanted Elasticsearch Java REST client users to know:

  1. That a query to Elasticsearch occurred
  2. How long this query took
  3. Which Elasticsearch node responded the query request
  4. Some info about the query result, like status code
  5. When an error occurred
  6. The query itself for _search operations

We also made the decision to only support sync queries as first step, delaying the async ones until we have a proper infrastructure in place.

I extracted the relevant code, uploaded it to a gist and referenced it throughout the post. Note that although it is not the actual code you would find in our GitHub repo, it is entirely functional and relevant.

Java agent specific aspects

When writing Java agent code, there are certain special considerations to make. Let’s try to go over them briefly before examining our test case.

Bytecode instrumentation

Don’t worry, you will not need to write anything in bytecode, we use the magical Byte Buddy library (that, in turn, relies on ASM) for that. For example, the annotations we use to say what to inject at the beginning and at the end of the instrumented method. You just need to keep in mind that some of the code you write is not really going to be executed where you write it, but rather injected as compiled bytecode into someone else’s code (which is a huge benefit of openness — you can see exactly what code is getting injected).

An example for Byte Buddy directives for bytecode injection

Class visibility

This may be the most elusive factor and where most pitfalls hide. One needs to be very aware of where each part of the code is going to be loaded from and what can be assumed to be available for it in runtime. When adding a plugin, your code will be loaded in at least two distinct locations — one in the context of the instrumented library/application, the other is the context of the core agent code. For example, we have a dependency on HttpEntity, an Apache HTTP client class that comes with the Elasticsearch client. Since this code is injected into one of the client’s classes, we know this dependency is valid. On the other hand, when using IOUtils (a core agent class), we cannot assume any dependency other than core Java and core agent. If you are not familiar with Java class loading concepts, it may be useful to get at least a rough idea about it (for example, reading this nice overview).

Overhead

Well, you say, performance is always a consideration. Nobody wants to write inefficient code. However, when writing agent code, we don’t have the right to make the usual overhead tradeoff decisions you normally make when writing code. We need to be lean in all aspects. We are guests at someone else’s party and we are expected to do our job seamlessly.

For more in-depth overview about agent performance overhead and ways of tuning it, check out this interesting blog post.

Concurrency

Normally, the first tracing operation of each event will be executed on the request-handling thread, one of many threads in a pool. We need to do as little as possible on this thread and do it fast, releasing it to handle more important business. Byproducts of these actions are handled in shared collections where they are exposed to concurrency issues. For example, the Span object we create at the very entry is updated multiple times across this code on the request handling thread, but later being used for serialization and sending to the APM server by a different thread. Furthermore, we need to know whether we trace sync or potentially-async operations. If our trace can start in some thread and continue in other threads we must take that into consideration.

Back to our test case

Following is a description of what it took to implement the Elasticsearch REST client plugin, divided into three steps only for convenience.

A word of warning: It’s becoming very technical from here on...

Step 1: Selecting what to instrument

This is the most important step of the process. If we do a bit of research and do this properly, we are more likely to find just the right method/s and make it real easy. Things to consider:

  • Relevance: we should instrument method/s that
    • Capture exactly what we want to capture. For example, we need to make sure that end-time minus start-time of the method reflects the duration of the span we want to create
    • No false positives. If method invoked we are always interested to know
    • No false negatives. Method always called when the span-related action is executed
    • Have all the relevant information available when entered or exited
  • Forward compatibility: we would aim for a central API that is not likely to change often. We don’t want to update our code for every minor version of the traced library.
  • Backward compatibility: how far back is this instrumentation going to support?

Not knowing anything about the client code (even though it’s Elastic’s), I downloaded and started investigating the latest version, which was 6.4.1 at the time. Elasticsearch Java REST client offers both high and low level APIs, where the high level API depends on the low level API, and all queries eventually go through the latter. Therefore, to support both, naturally we would only look in the low level client.

Digging into the code, I found a method with the signature Response performRequest(Request request) (here in GitHub). There are four additional overrides to the same method, all call this one and all are marked as deprecated. Furthermore, this method calls performRequestAsyncNoCatch. The only other method calling the latter is a method with the signature void performRequestAsync(Request request, ResponseListener responseListener). A bit more research showed that the async path is exactly the same as the sync one: four additional deprecated overrides calling a single non-deprecated one that calls performRequestAsyncNoCatch for making the actual request. So, for relevance the performRequest method got a score of 100%, as it exactly captures all and only sync requests, with both the request and response info available at entry/exit: perfect! The way we tell Byte Buddy that we want to instrument this method is by overriding the relevant matcher-providing methods.

The way we decide which class and method to instrument

Looking forward, this new central API seemed a good bet for stability. Looking backwards, though, it wasn’t as good of a choice — version 6.4.0 and prior didn’t have this API…

Since this was such a perfect candidate for instrumentation, I decided to use it and get long lasting support for Elasticsearch REST client, and add additional instrumentation for older versions. I did a similar process looking for a candidate there, and ended up with two solutions — one for versions 5.0.2 through 6.4.0 and another for versions >=6.4.1.

Step 2: Code design

We use Maven, and each new instrumentation we introduce to support a new technology would be a module we refer to as a plugin. In my case, since I wanted to test both old and new Elasticsearch REST clients (meaning conflicting client dependencies), and because the instrumentation would be a bit different for each, it made sense that each would have its own module/plugin. Since both are for supporting the same technology, I nested them under a common parent module, ending up with the following structure:

It is important that only the actual plugin code will be packaged into the agent, so make sure library dependencies are scoped as provided and test dependencies are scoped test in your pom.xml. If you add third-party code, it must be shaded, that is, repackaged to use the root Elastic APM Java agent package name.

As for the actual code, following are the minimal requirements for adding a plugin:

The Instrumentation class

An implementation of the abstract ElasticApmInstrumentation class. Its role is to assist with the identification of the right class and method for instrumentation. Since the type and method matching can considerably extend application startup times, the instrumentation class provides some filters that enhance the matching process, for example, overlook classes that do not contain a certain string in their name or classes loaded by a class loader that has no visibility at all to the type we are looking for. In addition, it provides some meta-information that enables turning the instrumentation on and off through configuration.

Notice that ElasticApmInstrumentation is used as a service, which means each implementation needs to be listed in a provider configuration file.

The service provider configuration file

Your ElasticApmInstrumentation implementation is a service provider, which is identified in runtime through a provider-configuration file located in the resource directory META-INF/services. The name of the provider-configuration file is the fully qualified name of the service and it contains a list of fully qualified names of service providers, one per line.

The Advice class

This is the class that provides the actual code that will be injected into the traced method. It doesn’t implement a common interface, but normally uses Byte Buddy’s @Advice.OnMethodEnter and/or @Advice.OnMethodExit annotations. This is how we tell Byte Buddy what code we want to inject at the beginning of a method and just before it is exited (quietly or throwing a Throwable). The rich Byte Buddy API allows us doing all kinds of fancy stuff here, like

Eventually, my Elasticsearch REST client module structure is as follows:

Step 3: Implementing

As said above, writing agent code has some specifics to it. Let’s see how these concepts came about in this plugin:

Span creation and maintenance

Elastic APM uses spans to reflect each event of special interest, like handling HTTP request, making DB query, making remote call etc. The root span of each span-tree recorded by an agent is called a Transaction (see more in our data model documentation). In this case, we are using a Span to describe the Elasticsearch query, as it is not the root event recorded on the service. Like in this case, a plugin will normally create a span, activate it, add data to it and eventually deactivate and end it. Activation and deactivation are the actions of maintaining a thread-context state that enables obtaining the currently active span anywhere in the code (like we do when creating the span). A span must be ended and an activated span must be deactivated, so using try/finally is a best practice in this regard. In addition, if an error occurs, we should report that as well.

Never break user code (and avoid side-effects)

Besides writing a very “defensive” code, we always assume our code may throw Exceptions, which is why we use suppress = Throwable.class in our advices. This tells Byte Buddy to add an Exception handler for all Throwable types thrown during advice code execution, which ensures user code is still executed if our injected code fails.

In addition, we must make sure we don’t cause any side effects with our advice code that may change the state of the instrumented code and affect its behavior as a result. In my case, this was relevant for reading the request body of Elasticsearch queries. The body is read by obtaining the request content stream through a getContent API. Some implementations of this API will return a new InputStream instance for each invocation, while others return the same instance for multiple invocations per request. Since we only know which implementation is used at runtime, we must ensure that reading the body will not prevent it from being read by the client. Luckily, there is also an isRepeatable API that tells us exactly that. If we fail to ensure that, we may break client functionality.

Class visibility

By default, the Instrumentation class is also the Advice class. However, there is one important difference between them due to their role. The Instrumentation methods are always invoked, regardless of whether the corresponding library it aims to instrument is actually available or even used at all. On the other hand, the Advice code is only used when the relevant class of a specific library was detected. My Advice code has dependencies on Elasticsearch REST client code in order to get information like the URL used for the request, the request body, response code, etc. Therefore, it would be safer to compile the Advice code in a separate class and only refer to it by the Instrumentation class where needed. Note that more often than not, Advice code will have dependencies on the instrumented library, so this may be a good practice in general.

Performance overhead considerations

One of the things we wanted to do is get _search queries, which means reading the HTTP request body that we have access to in the form of InputStream. There is not much we can do about the fact that we need to store the body content somewhere, so the memory overhead would be at least the length of the body we allow reading for each traced request. However, there is a lot to do regarding memory allocations, that translate to CPU and pauses due to garbage collection. So we reuse ByteBuffer to copy bytes read from the stream, CharBuffer to store the query content until serialized and sent to APM server and even CharsetDecoder. This way, we don’t allocate and deallocate any memory on a per-request basis. This decreases overhead on the account of a bit more complicated code (code in the IOUtils class).

End result 

General tips (not demonstrated in the test case)

Beware of nested calls

In some cases, when instrumenting API methods, you may encounter a scenario where one instrumented method calls another instrumented method. For example, an override method calling its super-method, or one implementation of the API wrapping another. Awareness of such scenarios is important, as we wouldn’t want multiple spans reported for the same action. There are no rules that say when this applies and when not, and most likely you would get different behaviour in different scenarios/settings, so the tip in this case is just to code in this awareness.

Beware of self-monitoring

Make sure your tracing code does not cause invocation of actions that will be traced as well. In the better scenario, this will result in reporting traced operations that  the result of the tracing process itself. In the worse case, this can lead to stack overflow. An example would be in JDBC tracing — when trying to get some DB info, we use the java.sql.Connection#getMetaData API, which may cause a DB query to occur and get traced, leading to another invocation of java.sql.Connection#getMetaData and so forth.

Be aware of asynchronous operations

Asynchronous execution means that a span/transaction may be created in one thread, then get activated in another thread. Each span/transaction must be ended exactly once and always deactivated in each thread it was activated at. Therefore, one must be absolutely aware to this aspect.

Summary

One of the major benefits of working on an open source project is the tight relationships with community. We are very happy to get any feedback, suggestions and contribution to our code base. Don’t hesitate to offer us code, and feel free to reach out and talk to us through our APM forum or GitHub repo before you start, to discuss approaches to avoid duplicate work.