Elastic APM for iOS and Android Native apps

Elastic APM for native apps provides auto-instrumentation of outgoing HTTP requests and view-loads, captures custom events, errors, and crashes, and includes pre-built dashboards for data analysis and troubleshooting purposes

141949-elastic-blogheaderimage.png

Elastic® APM for iOS and Android native apps is generally available in the stack release v8.12. The Elastic iOS and Android APM agents are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.

Overview of the Mobile APM solution

The OpenTelemetry SDK/API for iOS and Android supports capabilities such as auto-instrumentation of HTTP requests, API for manual instrumentation, data model based on the OpenTelemetry semantic conventions, and buffering support. Additionally, the Elastic APM agent distributions also support an easier initialization process and novel features such as remote config and user session based sampling. The Elastic iOS and Android APM agents being distributions are maintained per Elastic’s standard support T&Cs.

There are curated or pre-built dashboards provided in Kibana® for monitoring, data analysis, and for troubleshooting purposes. The Service Overview view shown below provides relevant frontend KPIs such as crash rate, http requests, average app load time, and more, including the comparison view. 

1 - comparison view

Further, the geographic distribution of user traffic is available on a map at a country and regional level. The service overview dashboard also shows trends of metrics such as throughput, latency, failed transaction rate, and distribution of traffic by device make-model, network connection type, and app version.  

The Transactions view shown below highlights the performance of the different transaction groups, including the distributed trace end-to-end of individual transactions with links to associated spans, errors and crashes. Further, users can see at a glance the distribution of traffic by device make and model, app version, and OS version. 

2- opbeans android

Tabular views such as the one highlighted below located at the bottom of Transactions tab makes it relatively easy to see how the device make and model, App version, etc., impacts latency and crash rate.

3 - latency and crash rate

The Errors & Crashes view shown below can be used to analyze the different error and crash groups. The unsymbolicated (iOS) or obfuscated (Android) stacktrace of the individual error or crash instance is also available in this view. 

4 - opbeans swift

The Service Map view shown below provides a visualization of the end-to-end service interdependencies, including any third-party APIs, proxy servers, and databases.  

5 - flowchart

The comprehensive pre-built dashboards for observing the mobile frontend in Kibana provide visibility into the sources of errors, crashes, and bottlenecks to ease troubleshooting of issues in the production environment. The underlying Elasticsearch® Platform also supports the ability to query raw data, build custom metrics and custom dashboards, alerting, SLOs, and anomaly detection. Altogether the platform provides a comprehensive set of tools to expedite root cause analysis and remediation, thereby facilitating a high velocity of innovation. 

Walkthrough of the debugging workflow for some error scenarios

Next, we will provide a walkthrough of the configuration details and the troubleshooting workflow for a couple of error scenarios in iOS and Android native apps.

Scenario 1

In this example, we will debug a crash in an asynchronous method using Apple’s crash report symbolication as well as breadcrumbs to deduce the cause of the crash. 

Symbolication
In this scenario, users notice a spike in the crash occurrences of a particular crash group in the Errors & Crashes tab and decide to investigate further. A new crash comes in on the Crashes tab, and the developer follows these steps to symbolicate the crash report locally.

1. Copy the crash via the UI and paste it into a file with the following name format <AppBinaryName>_<DateTime>. For example, “opbeans-swift_2024-01-18-114211.ips`. 

6 - Symbolication

2. Apple provides detailed instructions on how to symbolicate this file locally either automatically through Xcode or manually using the command line.

Breadcrumbs
The second frame of the first thread shows that the crash is occuring in a Worker instance.

7 - Breadcrumbs

This instance is actually used in many places, and due to the asynchronous nature of this function, it’s not possible to determine immediately where this call is coming from. Nevertheless, we can utilize features of the Open Telemetry SDK to add more context to these crashes and then put the pieces together to find the site of the crash. 

By adding “breadcrumbs” around this Worker instance, it is possible to track down which calls to the Worker are actually associated with this crash.

Example:
Create a logger provider in the Worker class as a public variable for ease of access, as shown below:

8 - example code

Create breadcrumbs everywhere the Worker.doWork() function is called: 

9 - Create breadcrumbs everywhere the Worker.doWork() function

Each of these breadcrumbs will use the same event name “worker_breadcrumb” so they can be consistently queried, and the differentiation will be done using the “source” attribute. 

In this example, the Worker.doWork() function is being called from a CustomerRow struct (a table row which does work ‘onTapGesture’). If you were to call this method from multiple places in a CustomerRow struct, you may also add additional differentiations to the “source” attribute value, such as the associated function (e.g., “CustomerRow#onTapGesture”). 

Now that the app is reporting these breadcrumbs, we can use Discover to query for them, as shown below:

10 - Discover to query

Note: Event names sent by the agent are translated to event action in Elastic Common Schema (ECS), so ensure the query uses this field.

  1. You can add a filter: `event.action: “worker_breadcrumb”` and it shows all events generated from this new breadcrumb.

  2. You can also see the various sources: ProductRow, CustomerRow, CartRow, etc.

  3. If you add error.type : crash to the query, you can see crashes alongside the breadcrumbs: 

11 - crashes along side the breadcrumbs

A crash and a breadcrumb next to each other in the timeline may come from completely different devices, so we need another differentiator. For each crash, we have metadata that contains the session.id associated with the crash, viewable from the Metadata tab. We can query using this session.id to ensure that the only data we are looking at in Discover is from a single user session (i.e., a single device) that resulted in the crash.

12. - session.id

In Discover, we can now see the session event flow, on a single device, concerning the crash via the breadcrumbs, as shown below:

13 - session event flow

It looks like the last breadcrumb before the crash was from the “CustomerRow” breadcrumb. Now this gives the app developer a good place to start their root cause analysis or investigation.

Scenario 2

Note: This scenario requires the Elastic Android agent version “0.14.0” or higher.

An Android sample app has a form composed of two screens that are created using two fragments (`FirstPage` and `SecondPage`). In the first screen, the app makes a backend API call to get a key that identifies the form submission. This key is stored in memory in the app and must be available on the last screen where the form is sent; the key must be sent along with the form's data.

14 - form submission

The problem
We start to see a spike in crash occurrences in Kibana (null pointer exception) in the Errors & Crashes tab that always seem to happen on the last screen of the form, when the users click on the "FINISH" button. Nevertheless, this is not always reproducible, so the root cause isn't clear just by looking at the crash’s stacktrace alone. Here’s what it looks like:

15 - stack trace

When we take a look at the code referenced in the stacktrace, this is what we can see:

16 - When we take a look at the code referenced in the stacktrace, this is what we can see:

This is the line where the crash happens, so it seems like the variable “formId” (which is a static String located in “FirstPage”) was null by the time this code was executed, causing a null pointer exception to be raised. This variable is set within the “FirstPage” fragment after the backend request is done to retrieve the id. The only way to get to the “SecondPage” is by passing through the “FirstPage.” So, the stacktrace alone doesn’t help much as the pages have to be opened in order, and the first one will always set the “formId” variable. Therefore, it doesn’t seem likely that the formId could be null in “SecondPage.”

Finding the root cause
Apart from taking a look at the crash’s stacktrace, it could also be useful to take a look at complementary data that would help put the pieces together and get a broader picture of what other things happened while our app was running when the crash happened. For this case, we know that the form ID must come from our backend service, so we could start by ruling out that there was an error with the backend call. We do this by checking the traces from the creation of our FirstPage fragment where the form ID request is executed, in the Transaction details view:

17 - trace sample

The “Created” spans represent the time it took to create the first fragment. The topmost one shows the Activity creation, followed by the NavHostFragment, followed by “FirstScreen.” Not long after its creation, we see that a GET HTTP request to our backend is made to retrieve our form ID and, according to the traces, the GET request was successful. We can therefore rule out that there is an issue with the backend communication for this problem.

Another option could be looking at the logs sent throughout the session in our app where the crash occurred (we could also take a look at all the logs coming from our app but they would be too many to analyze this one issue). To do so, we first copy one of the spans’ “session.id” values (any span would work since the same session ID will be available in all the data that was sent from our app during the time that the crash occurred) available in the span details flyout.

18 - red box highlighted

Note: The same session ID can also be found in the crash metadata.

Now that we have identified our session, we can open up the Logs Explorer view and take a look at all of our app’s logs within that same session, as shown below:

19 - app's logs

By looking at the logs, and adding a few fields to show the app’s lifecycle status and the error types, we see the log events that are automatically collected from our app. We can see the crash event at the top of the list as the latest one. We can also see our app’s lifecycle events, and if we keep scrolling through, we’ll get to some lifecycle events that are going to help find our root cause:

20 - root cause

We can see there are a couple of lifecycle events that tell us that the app was restarted during the session. This is an important hint because it means that the Android OS killed our app at some point, which is common when an app stays in the background for a while. With this information, we could try to reproduce the issue by forcing the OS to kill our app in the background and then see how it behaves when reopened from the recently opened apps menu.

After giving it a try, we could reproduce the issue and we found that the static “formId” variable was lost when the app was restarted, causing it to be null when the SecondPage fragment requested it. We can now research best practices of passing arguments to Fragments so we can change our code to prevent relying on static fields and instead store and share values between screens, thus preventing this crash from happening again.

Bonus: For this scenario, it was enough for us to rely on the events that are sent automatically by the APM Agent; however, if those aren’t enough for other cases, we can always send custom events in the places where we want to track the state changes of our app via the OpenTelemetry event API, as shown in the the code snippet below:

21 - black code box

Make the most of your Elastic APM Experience

In this post, we reviewed Elastic’s new Mobile APM solution available in 8.12. The new solution uses Elastic’s new iOS and Android APM agents that are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.

We also reviewed configuration details and the troubleshooting workflow for two error scenarios in iOS and Android native apps.

  • iOS scenario: Debug a crash in an asynchronous method using Apple’s crash report symbolication as well as breadcrumbs to deduce the cause of the crash.

  • Android scenario: Analyze why users get a null pointer exception on the last screen when they click on the “FINISH” button of a form. Analyzing this is not always clear by looking at the crash’s stack trace and isn’t easily reproducible.

In both instances, we found the root cause of the crash using distributed traces from the mobile device as well as correlated logs. Hopefully this blog provided a review of how Elastic can help manage and monitor Mobile native apps.

Elastic invites SREs and developers to experience our Mobile APM solution firsthand and unlock new horizons in their data tasks. Try it today at https://ela.st/free-trial.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.