Identifying Data for Analysis
editIdentifying Data for Analysis
editFor the purposes of this tutorial, we provide sample data that you can play with and search in Elasticsearch. When you consider your own data, however, it’s important to take a moment and think about where the machine learning features will be most impactful.
The first consideration is that it must be time series data. The machine learning features are designed to model and detect anomalies in time series data.
The second consideration, especially when you are first learning to use machine learning, is the importance of the data and how familiar you are with it. Ideally, it is information that contains key performance indicators (KPIs) for the health, security, or success of your business or system. It is information that you need to monitor and act on when anomalous behavior occurs. You might even have Kibana dashboards that you’re already using to watch this data. The better you know the data, the quicker you will be able to create machine learning jobs that generate useful insights.
The final consideration is where the data is located. This tutorial assumes that your data is stored in Elasticsearch. It guides you through the steps required to create a datafeed that passes data to a job. If your own data is outside of Elasticsearch, analysis is still possible by using a post data API.
If you want to create machine learning jobs in Kibana, you must use datafeeds. That is to say, you must store your input data in Elasticsearch. When you create a job, you select an existing index pattern and Kibana configures the datafeed for you under the covers.
Obtaining a sample data set
editIn this step we will upload some sample data to Elasticsearch. This is standard Elasticsearch functionality, and is needed to set the stage for using machine learning.
The sample data for this tutorial contains information about the requests that are received by various applications and services in a system. A system administrator might use this type of information to track the total number of requests across all of the infrastructure. If the number of requests increases or decreases unexpectedly, for example, this might be an indication that there is a problem or that resources need to be redistributed. By using the machine learning features to model the behavior of this data, it is easier to identify anomalies and take appropriate action.
Download the sample data and scripts by clicking here: server_metrics.tar.gz
Use the following command to extract the files:
tar -zxvf server_metrics.tar.gz
Each document in the server-metrics data set has the following schema:
{ "index": { "_index":"server-metrics", "_type":"metric", "_id":"1177" } } { "@timestamp":"2017-03-23T13:00:00", "accept":36320, "deny":4156, "host":"server_2", "response":2.4558210155, "service":"app_3", "total":40476 }
The sample data sets include summarized data. For example, the total
value is a sum of the requests that were received by a specific service at a
particular time. If your data is stored in Elasticsearch, you can generate
this type of sum or average by using aggregations. One of the benefits of
summarizing data this way is that Elasticsearch automatically distributes
these calculations across your cluster. You can then feed this summarized data
into machine learning instead of raw results, which reduces the volume
of data that must be considered while detecting anomalies. For the purposes of
this tutorial, however, these summary values are stored in Elasticsearch. For more
information, see Aggregating data for faster performance.
Before you load the data set, you need to set up mappings for the fields. Mappings divide the documents in the index into logical groups and specify a field’s characteristics, such as the field’s searchability or whether or not it’s tokenized, or broken up into separate words.
You can use scripts to create the mappings and load the data set. If the Elasticsearch
security features are enabled, use the upload_server_metrics.sh
script.
Before you run it, however, you must edit the USERNAME and PASSWORD variables
with your actual user ID and password. If the Elasticsearch security features are not
enabled, use the upload_server_metrics_noauth.sh
script instead.
The scripts run a curl
command that makes the following create index API
request:
PUT server-metrics { "settings": { "number_of_shards":1, "number_of_replicas":0 }, "mappings": { "metric": { "properties":{ "@timestamp": { "type":"date" }, "accept": { "type":"long" }, "deny": { "type":"long" }, "host": { "type":"keyword" }, "response": { "type":"float" }, "service": { "type":"keyword" }, "total": { "type":"long" } } } } }
To learn more about mappings and data types, see Mapping.
You can then use the bulk API to load the sample data set. The scripts run commands similar to the following example:
curl -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
There are twenty data files. The commands might take some time to run, depending on the computing resources available.
You can verify that the data was loaded successfully by running the cat indices API:
GET _cat/indices?v
You should see output similar to the following:
health status index ... pri rep docs.count ... green open server-metrics ... 1 0 905940 ...
Next, you must define an index pattern for this data set:
-
Open Kibana in your web browser. If you are running Kibana
locally, go to
http://localhost:5601/
. - Click the Management tab, then Kibana > Index Patterns.
- If you already have index patterns, click Create Index to define a new one. Otherwise, the Create index pattern wizard is already open.
-
For this tutorial, any pattern that matches the name of the index you’ve
loaded will work. For example, enter
server-metrics*
as the index pattern. -
In the Configure settings step, select the
@timestamp
field in the Time Filter field name list. - Click Create index pattern.
This data set can now be analyzed in machine learning jobs in Kibana.