Outlier detectionedit

Outlier detection is an analysis for identifying data points (outliers) whose feature values are different from those of the normal data points in a particular data set. Outliers may denote errors or unusual behavior.

We use unsupervised outlier detection which means there is no need to provide a training data set to teach outlier detection to recognize outliers. Unsupervised outlier detection uses various machine learning techniques to find which data points are unusual compared to the majority of the data points.

In the Elastic Stack, we use an ensemble of four different distance and density based outlier detection methods. By default, you don’t need to select the methods or provide any parameters, but you can override the default behavior if you like. The basic assumption of the distance based methods is that normal data points – in other words, points that are not outliers – have a lot of neighbors nearby, because we expect that in a population the majority of the data points have similar feature values, while the minority of the data points – the outliers – have different feature values and will, therefore, be far away from the normal points.

The distance of Kth nearest neighbor method (distance_kth_nn) computes the distance of the data point to its Kth nearest neighbor where K is a small number and usually independent of the total number of data points. The higher this distance the more the data point is an outlier.

The distance of K-nearest neighbors method (distance_knn) calculates the average distance of the data points to their nearest neighbors. Points with the largest average distance will be the most outlying.

While the results of the distance based methods are easy to interpret, their drawback is that they don’t take into account the density variations of a data set. This is the point where density based methods come into the picture, they are used for mitigating this problem. These methods take into account not only the distance of the points to their K nearest neighbors but also the distance of these neighbors to their neighbors.

Based on this approach, a metric is computed called local outlier factor (lof) for each data point. The higher the local outlier factor, the more outlying is the data point.

The other density based method that outlier detection uses is the local distance-based outlier factor (ldof). Ldof is a ratio of two measures: the first computes the average distance of the data point to its K nearest neighbors; the second computes the average of the pairwise distances of the neighbors themselves. Again, the higher the value the more the data point is an outlier.

As you can see, these four algorithms work differently, so they don’t always agree on which points are outliers. By default, we use all these methods during outlier detection, then normalize and combine their results and give every datapoint in the index an outlier score. The outlier score ranges from 0 to 1, where the higher number represents the chance that the data point is an outlier compared to the other data points in the index. Outlier detection is a batch analysis, it runs against your data once. If new data comes into the index, you need to do the analysis again on the altered data.

Feature influenceedit

Besides the outlier score, another value is calculated during outlier detection: the feature influence score. As we mentioned, there are multiple features of a data point that are analyzed during outlier detection. An influential feature is a feature of a data point that is responsible for the point being an outlier. The value of feature influence provides a relative ranking of features by their contribution to a point being an outlier. Therefore, while outlier score tells us whether a data point is an outlier, feature influence shows which features make the point an outlier. By doing this, this value provides context to help understand more about the reasons for the data point being unusual and can drive visualizations.