Root cause analysis with logs: Elastic Observability's anomaly detection and log categorization

illustration-machine-learning-anomaly-1680x980.png

With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs. It’s like trying to find a needle in a haystack. 

In this blog post, we’ll cover some of Elastic’s artificial intelligence for IT operations (AIOps) and machine learning (ML) capabilities for root cause analysis.

Elastic’s machine learning will help you investigate performance issues by providing anomaly detection and pinpointing potential root causes through time series analysis and log outlier detection. These capabilities will help you reduce time in finding that “needle” in the haystack. 

Elastic’s platform enables you to get started on machine learning quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training. 

Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To help get you started, there are several key features built into Elastic Observability to aid in analysis, helping bypass the need to run specific ML models. These features help minimize the time and analysis for logs.

Let’s review some of these built-in ML features:

Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.

Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action quicker. 

High-latency or erroneous transactions: Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. An overview of this capability is published here: APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions.

AIOps Labs: AIOps Labs provides two main capabilities using advanced statistical methods: 

  • Log spike detector helps identify reasons for increases in log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.
  • Log pattern analysis helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.

In this blog, we will cover anomaly detection and log categorization against the popular “Hipster Shop app” developed by Google, and modified recently by OpenTelemetry. 

Overviews of high-latency capabilities can be found here, and an overview of AIOps labs can be found here.

In this blog, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop. 

A search toolkit for the AI era

The Elasticsearch Relevance Engine (ESRE) gives developers the tools they need to build AI-powered search apps.

Build Generative AI Search Engines and Applications

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:

In our example, we’ve introduced issues to help walk you through the root cause analysis features: anomaly detection and log categorization. You might have a different set of anomalies and log categorization depending on how you load the application and/or introduce specific issues.

As part of the walk-through, we’ll assume we are a DevOps or SRE managing this application in production.

Root cause analysis

While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.

How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:

  • Anomaly detection
  • Log categorization

While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and  identify a root cause.

Machine learning for anomaly detection

Elastic will detect anomalies based on historical patterns and identify a probability of these issues. 

Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.

In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. More information on anomaly detection results can be found here. We can also dive deeper into the anomaly and analyze the details.

What you will see for the productCatalogService is that there is a severe spike in average transaction latency time, which is the anomaly that was detected in the service map. Elastic’s machine learning has identified a specific metric anomaly (shown in the single metric view). It’s likely that customers are potentially responding to the slowness of the site and that the company is losing potential transactions. 

One step to take next is to review all the other potential anomalies that we saw in the service map in a larger picture. Use an anomaly explorer to view all the anomalies that have been identified.

Elastic is identifying numerous services with anomalies. productCatalogService has the highest score and a good number or others: frontend, checkoutService, advertService, and others, also have high scores. However, this analysis is looking at just one metric. 

Elastic can help detect anomalies across all types of data, such as kubernetes data, metrics, and traces. If we analyze across all these types (via individual jobs we’ve created in Elastic machine learning), we will see a more comprehensive view as to what is potentially causing this latency issue.

Once all the potential jobs are selected and we’ve sorted by service.name, we can see that productCatalogService is still showing a high anomaly influencer score.

In addition to the chart giving us a visual of the anomalies, we can review all the potential anomalies. As you will notice, Elastic has also categorized these anomalies (see category examples column). As we scroll through the results, we notice a potential postgreSQL issue from the categorization, which also has a high 94 score. Machine learning has identified a “rare mlcategory,” meaning that it has rarely occurred, hence pointing to a potential cause of the issue customers are seeing.

We also notice that this issue is potentially caused by pgbench , a popular postgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes heavy load on the database host, likely causing the higher latency issues on the site. 

While this may or may not be the ultimate root cause, we have rather quickly identified a potentially issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.

Machine learning for log categorization

Elastic Observability’s service map has detected an anomaly, and in this part of the walk-through, we take a different approach by investigating the service details from the service map versus initially exploring the anomaly. When we explore the service details for productCatalogService, we see the following:

The service details are identifying several things:

  1. There is an abnormally high latency compared to expected bounds of the service. We see that recently there was a higher than normal (upward of 1s latency) compared to the average to 275ms on average.
  2. There is also a high failure rate for the same time frame as the high latency (lower left chart “Failed transaction rate”).
  3. Additionally, we can see the transactions and one in particular /ListProduct has an abnormally high latency, in addition to a high failure rate.
  4. We see productCatalogService has a dependency on postgreSQL.
  5. We also see errors all related to postgreSQL.

We have an option to dig through the logs and analyze in Elastic or we can use a capability to identify the logs more easily. 

If we go to Categories under Logs in Elastic Observability and search for postgresql.logto help identify postgresql logs that could be causing this error, we see that Elastic’s machine learning has automatically categorized the postgresql logs.

We notice two additional items:

  • There is a high count category (message count of 23,797 with a high anomaly of 70) related to pgbench (which is odd to see in production). Hence we search further for all pgbench related logs in Categories
  • We see an odd issue regarding terminating the connection (with a low count).

While investigating the second error, which is severe, we can see logs from Categories before and after the error.

This troubleshooting shows postgreSQL having a FATAL error, the database shutting down prior to the error, and all connections terminating. Given the two immediate issues we identified, we have an idea that someone was running pgbench and this potentially overloaded the database, causing the latency issue that customers are seeing. 

The next steps here could be to investigate anomaly detection and/or work with the developers to review the code and identify pgbench as part of the deployed configuration.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you further identify and get closer to pinpointing root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of lessons and what you learned:

  • Elastic Observability has numerous capabilities to help you reduce your time to find root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities in this blog:
    1. Anomaly detection: Elastic Observability, when turned on (see documentation), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.
    2. Log categorization: Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped based on their messages and formats so that you can take action quicker. 
  • You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which help drive these features), nor having to do any lengthy setups.
Ready to get started? Register for Elastic Cloud and try out the features and capabilities I’ve outlined above.