September 26, 2017

Machine Learning for Nginx Logs - Identifying Operational Issues with Your Website

Editor's Note (August 3, 2021): This post uses deprecated features. Please reference the map custom regions with reverse geocoding documentation for current instructions.

Getting insight from nginx log files can be complicated. This blog shows how machine learning can be used to automatically extract operational insights from large volumes of nginx log data.

Overview

Data science can be a complicated, experimental process where it is easy to get lost in the data, or the counter-intuitiveness of statistics. Therefore, a key design goal for the Machine Learning group at Elastic is to develop tools that empower a wide spectrum of users to get insight out of Elasticsearch data.

This lead to us to develop features such as "Single Metric Job" and "Multiple Metric Job" wizards in X-Pack Machine Learning, and we are planning to simplify analysis and configuration steps even more in upcoming releases.

In parallel to these wizards, we are also planning to shrink-wrap job configurations on known Beats and Logstash data sources. For example, if you are collecting data with the Filebeat NGINX module, we can provide a set of shrink-wrapped configurations and dashboards to help users apply machine learning to their data. These configurations are also aimed at showing how we develop Machine Learning configurations internally based on our experience.

The details of how to install these configurations will be covered in a subsequent blog. This blog is aimed at describing the use cases and configurations.

Use Case Notes

The configuration options for X-Pack Machine Learning are extensive, and often new users are tempted to start with complex configurations and select large numbers of attributes and series. These types of configurations can be very powerful and expressive, but require care as the results can be difficult to interpret. We therefore recommend that users start with simple, well-defined use cases, and build out complexity as they become more familiar with the system. (Note, often the best initial use cases come from automating anomaly detection on charts on the Operations teams core dashboards.)

Example Data Description

The data used in these examples is from a production system consisting of 4 load balanced nginx web servers. We analysed 3 months data (~29,000,000 events, ~1,100,000 unique visitors, ~29GB data). Note, the data shown here has been anonymised.

nginx log format:

'"$http_x_forwarded_for" $remote_addr - [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';

Sample log message:

"2021:0eb8:86a3:1000:0000:9b3e:0370:7334 10.225.192.17 10.2.2.121" - - [30/Dec/2016:06:47:09 +0000] "GET /test.html HTTP/1.1" 404 8571 "-" "Mozilla/5.0 (compatible; Facebot 1.0; https://developers.facebook.com/docs/sharing/webmasters/crawler)"

Once processed by Filebeat's NGINX module configuration, we get the following JSON document in Elasticsearch:

... { "nginx" : { "access" : { "referrer" : "-", "response_code" : "404", "remote_ip" : "2021:0eb8:86a3:1000:0000:9b3e:0370:7334", "geoip" : { "continent_name" : "Europe", "country_iso_code" : "PT", "location" : { "lon" : -10.23057, "lat" : 34.7245 } }, "method" : "GET", "user_name" : "-", "http_version" : "1.1", "body_sent" : { "bytes" : "8571" }, "remote_ip_list" : [ "2021:0eb8:86a3:1000:0000:9b3e:0370:7334", "10.225.192.17", "10.2.2.121" ], "url" : "/test.html", "user_agent" : { "major" : "1", "minor" : "0", "os" : "Other", "name" : "Facebot", "os_name" : "Other", "device" : "Spider" } } } }...

Use Case 1: Changes in Website Visitors

Operationally, system issues are often reflected in changes in visitor rate. For example, if the visitor rate declines significantly in a short period of time, it is likely that there is a system issue with the site. Simple ways to understand changes in visitor rate are to analyse overall event rate, or the rate number of distinct visitors.

Job 1.1: Low Count of Website Visitors

This job can simply be configured using the 'Single Metric Job' wizard: Low Count of Website Visitors

Job configuration summary:

Low Count of Website Visitors Configuration

This analysis shows a significant anomaly on February 27th where the total event rate drops significantly:

Low Count of Website Visitors Anomalies

(Note this analysis of the 29,000,000 events took a total of 16s on a m4.large AWS instance)

Job 1.2: Low Count of Unique Website Visitors

Event counts can be strongly influenced by bots or attackers, and so a more consistent feature to analyse the number of unique website visitors. Again this can simply be configured using the 'Single Metric Job' wizard:

Low Count of Unique Website Visitors

Again there is a significant anomaly on February 27th where the number of unique visitors per 15m drops from a typical 1487 to 86:

Low Count of Unique Website Visitors Anomalies

Combining Job 1.1 and 1.2:

Using the Anomaly Explorer the results from both jobs can be temporary correlated to give an 'Overall' view into the anomalousness of the system based on these features:

Anomaly Explorer

This clearly shows in a single view, that there was a significant anomaly on February 27th between 10:00-12:00 where the total event rate dropped, and the number of unique visitors dropped.

The operations team confirmed the site had significant issues at this time due to a prior configuration change in the CDN. Unfortunately, they didn't detect the user impact until 11:30 (due to internal users on Slack complaining), whereas with ML they would have been alerted at 10:00 when the issue occurred.

This analysis can be combined with alerting to give operations teams early insights into changes in system behaviour.

Use Case 2: Changes in Website Behaviour

Once simple behaviours are analysed, next steps are often to analyse more complex features. For example, changes in event rates of the different HTTP status codes returned by the webserver can often indicate changes in system behaviour or unusual clients:

HTTP status codes over time

This use case is more complex as it involves analysing multiple series concurrently, but it can again be simply configured using the "Multiple Metric Job" wizard:

HTTP status codes analysis

Results show some significant changes in the different response codes:

HTTP status codes anomalies

In particular, again on February 27th there is a significant change in behaviour of response_code 404, 301, 306 and 200. Zooming in on 404s show some significant anomalies:

HTTP status codes anomalies

The first highlighted anomaly is attributed to a specific IP address as nginx.access.remote_ip is defined as an influencer (more on this in a later blog). The second highlighted anomaly represents a significant overall change in 404 behaviour.

The increase in 404s on February 27th was again a new insight for the operations team, and represented a large number of dead links that had been introduced by the configuration change.

Use Case 3: Unusual Clients

Website traffic generally consists of a combination of normal usage, scanning by bots and attempted malicious activity. Assuming the majority of clients are normal, we can use population analysis to detect significant attacks or bot activity.

The number of pages a normal user requests in a 5-minute window can be limited by how fast they can manually click website pages. Automated processes can scan 1000s of pages a minute, and attackers can simply flood a site with requests.

There are a number of features we could use to differentiate traffic types, but in the first instance, event rate and number of distinct URL rate by a client can highlight unusual client activity.

In this case, advanced job configuration is used to configure 2 population jobs:

Population analysis configuration