11 July 2017 Engineering

Using Elasticsearch and Machine Learning for IT Operations

By Thomas Grabowski

Effective management of IT Operations requires getting feedback about the activity and performance of the servers, applications, and network infrastructure, as well as any problems that may be occurring. The primary way to get this operational data is through collecting the metrics and log data that these systems produce. Up until now most operations teams depended on the expertise of the staff to search and report on the operational data, but now the operational staff can use machine learning to identify anomalies in their data to be more efficient in analyzing that data and become more effective in their jobs.

In this post we explain how IT Operations staff can take advantage of machine learning with their operational data. First, we describe how machine learning can enhance current search, reporting, and alerting scenarios. Second, we illustrate how Elastic’s new machine learning feature can be integrated into regular IT Operations tasks. Third, we detail what type of operational data is best suited for machine learning in the Elastic Stack. Finally, we provide additional information for anyone to get started with Elastic’s machine learning.

Machine learning enhances search, reporting, and alerting

The Elastic Stack is a great foundation for monitoring IT Operations logs and metrics. It includes an important set of tools operations teams need for their applications and machine data. Operations teams are always looking for tools to reduce mean time to repair (MTTR) and proactively discover potential problem areas. The Elastic Stack provides an easy to use interface for real time search, reporting, and analysis of streaming metrics and logs. Now with the addition of integrated machine learning, the Elastic Stack becomes even more useful for IT Operations.

After metrics and logs are collected, organizations often begin by leveraging the power of Elasticsearch to query their data. There is a lot of value from being able to search for specific information in the operations data. For example, it can be very useful to search for a specific IP address in the operational data and follow its utilization of the application, but for large applications, it is not practical to review every single IP’s activity. Once the volume of data grows too large, most operations teams begin to see the difficulty in merely searching their data. Finding answers is contingent upon asking the right questions. Machine learning’s anomaly detection can help point out the right question to ask and reduce the difficulty in searching for data that operations staff doesn’t know exists.

Aggregated operations data can be organized into report visualizations and dashboards, like those found in Kibana. Kibana dashboards are great for getting overviews of data, but they don’t always show the most important data at that time. For example, when creating a report for reviewing web server logs it is not difficult to create a visualization of http status codes over time, but it would be very difficult to create enough visualizations to review status codes per unique IP address. Machine learning can evaluate the data to show entities that are not acting in normal patterns.

Up until now the most common way to proactively watch data was to use threshold or rules-based alerts. The downside of static thresholds is that they are often too rigid for dynamically changing data, and optimizing the setting of thresholds to avoid false positives is a tedious process fraught with the likelihood of delegitimizing the alert to the consumer of it. Machine learning allows alerts to be more dynamic by learning normal behavior models and alerting when data doesn’t fit the historical model.

Operations organizations need their tools to be smarter so they can make quicker, better decisions in less time to reduce their MTTR and not waste time trying to search for a cause in the area of data that is not impacting an outage or degraded application. Operational data can become overwhelming in volume and complexity very quickly. It can require a lot of resources and time to build all the necessary searches, visualizations, and alerts to identify key performance issues or failures in the applications or architecture. With the introduction of machine learning, IT Operations teams can be more effective.

Using Elastic’s machine learning in IT Operations tasks

Elastic’s machine learning allows IT Operations teams to be more effective by using unsupervised learning algorithms to sort through the operations data. These algorithms can identify unusual activity based on historical analysis. Comparing new data to a historical model can produce more useful and effective alerting and graphical visualizations of operational data.

One of the goals for IT Operations is to detect and respond to application and infrastructure issues quickly. When there is a critical issue with an application and it is sending error messages in its log data, it is important that the administrator identify those unique messages and not lose them in the noise of all the other data being collected. Elastic’s machine learning can monitor message codes in the log messages and identify when there is a new error code or an unusual spike in a specific message code. Operations staff can be notified of these anomalies and then utilize search and reports from logs and metrics to get to a quicker root cause analysis.

Also, there are times when an issue isn’t sudden but rather happens because of a planned application or system configuration change. When an application or system has a planned configuration change it is sometimes not obvious how it will change the overall applications and user experiences. By monitoring a few of the key metrics with machine learning, administrators can get a better view of the changes to the application utilization both before and after the configuration change. Elastic’s machine learning can model normal behavior and identify how the ‘normal’ model changes because of the application or system change. This information can be used for validation that the change really did what it was intended to do.

Active Operations teams spend a lot of time writing alert rules to notify them of significant events in their data. After awhile, alert rules become difficult to maintain as the application or business evolves and they are costly to write and maintain. Eventually, writing rules to alert becomes ineffective and can cause alert overload where no one is paying attention to the alerts because there are too many alerts to review. Elastic’s machine learning can automate the detection of potential issues and give the alerts a score about how abnormal the alert is based on historical behavior. By reducing the human burden in visual inspection of managing rules, machine learning can allow IT operations staff be more effective in finding and reporting on issues. It also accelerates root cause analysis and provides information to get to faster resolution by providing an indication of how abnormal an event was and what other significant events occurred at the same time.

Elastic’s machine learning can use its algorithms to report on the chain of events that lead up to an issue. It doesn’t stop at telling you something is an anomaly. Using statistical testing, it identifies influential factors that contributed to the anomaly. For example, is the spike in application failures most likely coming from a failed drive or a misconfigured setting? Understanding the contributing factors helps resolve the issue more quickly. It assists administrators in getting to the root cause by automatically identifying what other data most likely influenced the issue.

Operations data is a good fit for Elastic’s machine learning

Elastic’s machine learning currently is focused on providing added value to time series data such as log files, application metrics, system performance metrics, network flows, and other transactional data that can be collected and stored in Elasticsearch. The key to getting good information is building a model of key indicators that occur with regularity throughout a data set.

When working with log data, some examples of regularly occurring key indicators would be HTTP status codes in Apache or Nginx web server messages, or message codes in Cisco IOS messages. These key indicators occur regularly enough to be trended over a time period for the number of times they occur.

Metrics for memory utilization or duration counters are another example of regularly occurring key indicators. These numerical fields can be averaged over a period of time so that the machine learning algorithms can model. It is best to use indicators that stay fairly constant from one time period to the next, so it is important not to make the time period too small, where the indicators could change significantly over a time interval.

How to get started with machine learning on IT Operations data

Elastic’s machine learning comes as a part of X-Pack. It’s easy to get a trial up and running quickly with some operations data. The machine learning option is available on Elastic Stack starting with version 5.4 with X-Pack installed. Once you have a system running with X-Pack the machine learning options are accessible through the Kibana web interface.

If there is already data in Elasticsearch and Kibana it is easy to get started working with the machine learning feature. For best results we recommend at least three weeks worth of data before running machine learning jobs on it so that the software can learn and model what is normal.

Here is some additional content to help any IT Operations team can learn more about Elastic’s machine learning -

Download a free trial of X-Pack and try it out.

Get a full product tour in the webinar.

Try Elastic’s machine learning video series:

Take the X-Pack online machine learning training (Free for a limited time)

Try additional IT operations machine learning recipes: