Identifying Data for Analysis

For the purposes of this tutorial, we provide sample data that you can play with and search in Elasticsearch. When you consider your own data, however, it’s important to take a moment and think about where the X-Pack machine learning features will be most impactful.

The first consideration is that it must be time series data. The machine learning features are designed to model and detect anomalies in time series data.

The second consideration, especially when you are first learning to use machine learning, is the importance of the data and how familiar you are with it. Ideally, it is information that contains key performance indicators (KPIs) for the health, security, or success of your business or system. It is information that you need to monitor and act on when anomalous behavior occurs. You might even have Kibana dashboards that you’re already using to watch this data. The better you know the data, the quicker you will be able to create machine learning jobs that generate useful insights.

The final consideration is where the data is located. This tutorial assumes that your data is stored in Elasticsearch. It guides you through the steps required to create a datafeed that passes data to a job. If your own data is outside of Elasticsearch, analysis is still possible by using a post data API.

Important

If you want to create machine learning jobs in Kibana, you must use datafeeds. That is to say, you must store your input data in Elasticsearch. When you create a job, you select an existing index pattern and Kibana configures the datafeed for you under the covers.

Obtaining a Sample Data Set

In this step we will upload some sample data to Elasticsearch. This is standard Elasticsearch functionality, and is needed to set the stage for using machine learning.

The sample data for this tutorial contains information about the requests that are received by various applications and services in a system. A system administrator might use this type of information to track the total number of requests across all of the infrastructure. If the number of requests increases or decreases unexpectedly, for example, this might be an indication that there is a problem or that resources need to be redistributed. By using the X-Pack machine learning features to model the behavior of this data, it is easier to identify anomalies and take appropriate action.

Download this sample data by clicking here: server_metrics.tar.gz

Use the following commands to extract the files:

tar -zxvf server_metrics.tar.gz

Each document in the server-metrics data set has the following schema:

{
  "index":
  {
    "_index":"server-metrics",
    "_type":"metric",
    "_id":"1177"
  }
}
{
  "@timestamp":"2017-03-23T13:00:00",
  "accept":36320,
  "deny":4156,
  "host":"server_2",
  "response":2.4558210155,
  "service":"app_3",
  "total":40476
}
Tip

The sample data sets include summarized data. For example, the total value is a sum of the requests that were received by a specific service at a particular time. If your data is stored in Elasticsearch, you can generate this type of sum or average by using aggregations. One of the benefits of summarizing data this way is that Elasticsearch automatically distributes these calculations across your cluster. You can then feed this summarized data into X-Pack machine learning instead of raw results, which reduces the volume of data that must be considered while detecting anomalies. For the purposes of this tutorial, however, these summary values are stored in Elasticsearch. For more information, see Aggregating Data For Faster Performance.

Before you load the data set, you need to set up mappings for the fields. Mappings divide the documents in the index into logical groups and specify a field’s characteristics, such as the field’s searchability or whether or not it’s tokenized, or broken up into separate words.

The sample data includes an upload_server-metrics.sh script, which you can use to create the mappings and load the data set. You can download it by clicking here: upload_server-metrics.sh Before you run it, however, you must edit the USERNAME and PASSWORD variables with your actual user ID and password.

The script runs a command similar to the following example, which sets up a mapping for the data set:

curl -u elastic:changeme -X PUT -H 'Content-Type: application/json'
http://localhost:9200/server-metrics -d '{
   "settings":{
      "number_of_shards":1,
      "number_of_replicas":0
   },
   "mappings":{
      "metric":{
         "properties":{
            "@timestamp":{
               "type":"date"
            },
            "accept":{
               "type":"long"
            },
            "deny":{
               "type":"long"
            },
            "host":{
               "type":"keyword"
            },
            "response":{
               "type":"float"
            },
            "service":{
               "type":"keyword"
            },
            "total":{
               "type":"long"
            }
         }
      }
   }
}'
Note

If you run this command, you must replace changeme with your actual password.

You can then use the Elasticsearch bulk API to load the data set. The upload_server-metrics.sh script runs commands similar to the following example, which loads the four JSON files:

curl -u elastic:changeme -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"

curl -u elastic:changeme -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"

curl -u elastic:changeme -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"

curl -u elastic:changeme -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
Tip

This will upload 200MB of data. This is split into 4 files as there is a maximum 100MB limit when using the _bulk API.

These commands might take some time to run, depending on the computing resources available.

You can verify that the data was loaded successfully with the following command:

curl 'http://localhost:9200/_cat/indices?v' -u elastic:changeme

You should see output similar to the following:

health status index ... pri rep docs.count  ...
green  open   server-metrics ... 1 0 905940  ...

Next, you must define an index pattern for this data set:

  1. Open Kibana in your web browser and log in. If you are running Kibana locally, go to http://localhost:5601/.
  2. Click the Management tab, then Index Patterns.
  3. If you already have index patterns, click the plus sign (+) to define a new one. Otherwise, the Configure an index pattern wizard is already open.
  4. For this tutorial, any pattern that matches the name of the index you’ve loaded will work. For example, enter server-metrics* as the index pattern.
  5. Verify that the Index contains time-based events is checked.
  6. Select the @timestamp field from the Time-field name list.
  7. Click Create.

This data set can now be analyzed in machine learning jobs in Kibana.