2015年02月6日

Spotting Bad Actors: What Your Logs Can Tell You about Protecting Your Business

By Mark Harwood

In this blog post, we'll use Elasticsearch's aggregations to analyze web server log files, with the goal of discovering how to block unwelcome visitors to a site.

This is a responsibility every webmaster has, and the fundamental choice is to either:

  • ban a specific IP address (e.g. 121.205.248.124); or
  • ban an entire subnet (e.g. 121.205)

It may be more convenient for a webmaster to block a whole range of IP addresses grouped under a single subnet.However, this may be being over-zealous as there may be a wealth of well-behaved visitors who are now blocked along with the bad guys. How do webmasters understand the mix of good vs bad traffic at each level to make a good business decision?

Elasticsearch to the rescue!

We'll explore how the netrisk plugin runs Elasticsearch queries to find sources of bad behavior and uses a "Sankey" flow diagram (see below) to illustrate the size and concentrations of risk at various points in a site's traffic flow.

MarkHarwoodBlog

Setting up the example

To follow along with this demonstration, you will first need to install the netrisk plugin running the following command in your elasticsearch (1.4.0 or later) home directory:

bin/plugin -install markharwood/netrisk

All being well, the plugin should be installed. Before we can use it, however, we must provide some data with the appropriate configuration. We'll examine the details of the required index mapping later, but, for now, you can index some anonymized test data by running the shell script in this directory:

$ES_HOME/plugins/netrisk/exampleData/indexAnonData.sh

This will create an index called "mylogs" with some data from real log records which contain an anonymized IP address and an HTTP response code.

Caution - This will delete any existing index called "mylogs".

Finally you can launch the plugin using the URL http://localhost:9200/_plugin/netrisk/

Running the example

The example data has a limited set of attributes we can use to identify risky traffic. We only have HTTP response status codes, but these are sufficient to find some bad behaviour in this data. The 200/300 range of HTTP response codes represents typical site traffic, whereas the 400/500 ranges represent failures, e.g. attempts to access non-existent pages. The query to single out those requests in our data is as follows:

status:[400 TO 599]

The netrisk plugin uses the standard Lucene query parser (the same one used by Kibana), so your query could use ORs to look for other features in your data that might suggest risky traffic, e.g. requests missing a UserAgent. Our query does not need to be to be too certain in determining what is "bad" - we just need to suggest a sense of what might have a bad smell about it and then the aggregations framework will do the rest in finding sources with high concentrations of this type of content, undiluted by anything else.

If we run the above search, the Sankey diagram should appear showing trails of the riskier traffic flowing into our website. Various pieces of information are summarized in the diagram:

  • Line thickness represents the volume of "bad' requests made, but volume is not everything here!
  • More importantly, the choice of line color represents the mix of good vs bad requests from a source with pure red being wholly bad and green being mostly good. (All points in the diagram have some degree of badness). Mixtures of red and green are a brownish color and represent a mixed profile of behavior. Hovering over a line reveals the actual numbers behind the coloring.
  • Each subnet in the diagram includes a count of the IP addresses that have been observed under that subnet. This can indicate the number of addresses affected if a webmaster chooses to ban this entire subnet in any firewall changes.
  • Clicking on a full IP address will take you to the "project honeypot" website to review any comments from webmasters about this address and how it may have been misbehaving elsewhere.

You'll notice that the lines in the diagram tend to change from red to green as they move from left to right through various stages of subnet. This is because on the left hand side of the diagram each node is typically representing smaller numbers of users who are dedicated to bad behavior. Each stage to the right represents a subnet covering a larger number of users who will typically dilute the good/bad mix with added volumes of well-behaved users who make up the normal access patterns. However, some subnets may represent entire countries where the mix of traffic stays in the red because your site may not have any relevance to that region, and the only people interested in visiting your site from there are miscreants.

How does it work?

Preparing the data

This analysis relies on having statistics about the frequencies of both IP addresses and subnets encoded in the index. To do this analysis, we can't just index each IP address as a single string; we must break it up into multiple tokens (e.g. the IPv4 address 186.28.25.186 is indexed as tokens 186.28.25.186, 186.28.25, 186.28 and 186). This is done using the following mapping definition:

    curl -XPOST    "http://localhost:9200/mylogs" -d '
    {
       "settings": {
          "analysis": {
             "analyzer": {
                "ip4_analyzer": {
                   "tokenizer": "ip4_hierarchy"
                }
             },
             "tokenizer": {
                "ip4_hierarchy" : {
                   "type" : <strong>"PathHierarchy"</strong>,
                   "delimiter" : "."
                }
             }
          }
       },
       "mappings": {
          "log": {
             "properties": {
                "remote_host": {
                   "type": "string",
                    "index": "not_analyzed",
                   "fields": {
                      "subs": {
                         "type": "string",
                         "index_analyzer": "ip4_analyzer",
                         "search_analyzer": "keyword"
                      }
                   }
                }
             }
          }
       }
    }

This gives us the power to quickly look up the number of logged events at 4 levels of hierarchy for each IP address. For IPv6 addresses, the delimiter used would be a ":" and some additional logic would be required to deal with the "::" syntax that is shorthand for zeroes.

The queries behind the netrisk tool

The netrisk tool takes your choice of query which identifies "bad" (or perhaps more accurately, "potentially bad") and uses an aggregation called the "significant_terms" aggregation to examine which IP addresses or subnets are disproportionately represented in the set of bad requests. We here at Elasticsearch call this the "uncommonly common". The template looks like this:

    curl -XGET "http://localhost:9200/anonlogs/_search?search_type=count" -d'
    {
       "query": {
          "query_string": {
             "query": "status:[400 TO 599]"
          }
       },
       "aggs": {
          "sigIps": {
             "significant_terms": {
                "field": "remote_host.subs",
                "size": 50,
                "shard_size": 50000,
                "gnd": {}
             }
          }
       }
    }'

This query will pick the 50 highest-risk IP addresses or subnets in your index. The points of interest here are:

  • We need a high shard_size setting to join up the stats across potentially many indexes/shards. This will incur memory and network costs plus require a lot of disk look-ups for the many unique terms. If we didn't index the full IP address in the field <code>remote_host.subs this would reduce the number of unique terms considered here but produce results that only went down to a certain level of resolution.
  • The "GND" scoring heuristic is more suited to this task than the default JLH scoring approach. Ordinarily, we use significant terms with the JLH heuristic to favor rare words (e.g. that the terms "Nosferatu" and "Helsing" are significantly correlated with the set of docs that are Dracula movies rather than the less insightful observation that the common word "he" is similarly increased in popularity). In this IP analysis task, we do want to identify some of the more common terms that represent popular subnets and the GND scoring heuristic places more emphasis on terms such as this (the ranking biases of the various heuristics are shown here.

This single query will do the bulk of the analysis and provide us with the main offenders in our system but we need to address some potential inaccuracies before inviting webmasters to block a particular subnet or IP address. What if our "bad" query identified some matches for an IP address on one shard but failed to match a single record in another shard holding a multitude of only "good" records? If the IP was a dynamic one and was reallocated, it is plausible that when using time-based indices that this sort of discrepancy between shards might happen. The shard with only good records would fail to return any stats on this benign behaviour as part of the initial risk assessment. To verify our stats a subsequent query is required to gather all of the stats from all of the shards for our assumed high-risk selections. The query to do this looks as follows:

    {
        "query":{
            "terms":{"remote_host.subs":["256.52","186.34.56" ...]}
        },
        "aggs": {
            "ips" : {
                      "filters" : {
                        "filters" : {
                            "256.52":{ "term" : { "remote_host.subs" : "256.52"   }},
                            "186.34.56":{ "term" : { "remote_host.subs" : "186.34.56"   }}
                            ...
                        }
                      },
                "aggs": {
                    "badTraffic": {
                        "filter": {
                            "query":{"query_string": {"query": "status:[400 TO 599]"}
                            }
                        }
                    },
                    "uniqueIps": {
                        "cardinality": {
                            "field":"remote_host"
                        }
                    }
                }
            }
        }
    }

For each of the perceived high-risk IPs/subnets we create a bucket and for ALL shards count:

  1. 1. The total number of log records (good and bad)
  2. 2. The total number of bad log records
  3. 3. The total number of unique IP addresses that share this subnet

Using the above we can now accurately color and size the selected nodes in our diagram.

Conclusion

Tracking behaviors of entities like IP addresses across multiple log records and shards is a tough computation problem. The analysis performed by this plugin is at the upper end of what you might attempt at scale using a typical time-based index of log records.

Below are some examples of even more challenging forms of behavioral analysis:

  1. 1. How long do my site visitors spend on average on my site?
  2. 2. Which IP addresses behave like bots (never request CSS or javascript, only web pages)
  3. 3. What is the first/last web page users typically visit on my site?

To attempt these forms of analysis at scale, we need to fuse related log records into summaries of an entity's behavior over time. Thankfully, there is a way of doing this and it is the subject of the "entity-centric indexing" talk I will be giving at Elastic{ON}

If you are coming to Elastic{ON}, come join in on the fun on the afternoon of Tuesday, March 10th. If not, stay tuned for updates on twitter on when videos of the session may be ready. Either way, have a great weekend and may your logs be ever in your favor.